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Introduction 


This book aims to get you up to speed with what is, in my opinion, 
the most powerful data-visualisation stack going: Python and Java¬ 
Script. You’11 learn enough of big libraries like Pandas and D3 to 
start crafting your own web data-visualisations and refining your 
own toolchain. Expertise will come with practice but this book 
presents a shallow learning curve to basic competence. 


S 


If youre reading this in Early Release form Id 
love to hear any feedback you have. Please post 
it to pyjsdataviz@kyrandale.com. Thanks a lot, 
Kyran. 


You’11 also find a working copy of the Nobel- 
visualisation the book literaUy and figuratively 
builds towards at http://kyrandale.com/static/pyjs 
dataviz/ index.html. 


The bulk of this book telis one of the innumerable tales of data- 
visualisation, one carefully selected to showcase some powerful 
Python and JavaScript libraries or tools which together form a tool¬ 
chain. This toolchain gathers raw, unrefined data at its start and 
delivers a rich, engaging web-visualisation at its end. Like all tales of 
data-visualisation it is a tale of transformation, in this case trans- 
forming a basic Wikipedia list of Nobel prize-winners into an inter- 
active visualisation, bringing the data to life and making exploration 
of the prize’s history easy and fun. 

A primary motivation for writing the book is the belief that, what- 
ever data you have, whatever story you want to teU with it, the natu- 
ral horne for the visualizations you transform it into is the web. As a 







delivery platform it is orders of magnitude more powerful than what 
came before and this book aims to smooth the passage from desktop 
or server-based data analysis and processing to getting the fruits of 
that labour out on the web. 

But the most ambitious aim of this book is to persuade you that 
working with these two powerful languages towards the goal of 
delivering powerful web-visualisations is actually fun and engaging. 

I think many potential data-viz programmers assume there is a big 
divide, called Web Development, between doing what they would like 
to do, which is program in Python and JavaScript. Web-dev involves 
loads of arcane knowledge about markup-languages, style-scripts, 
administration etc. and can’t be done without tools with strange 
names like Gulp or Yeoman. I aim to show that these days that big 
divide can be collapsed to a thin and very permeable membrane, 
allowing you to focus on what you do well, programming stuff (see 
Figure P-1) with minimal effort, relegating the web-servers to data- 
delivery. 


Perception Reality 



Figure P-1. Flere be web-dev dragons 


WhoThis Book is For 

First off, this book is for anyone with a reasonable grasp of Python 
or JavaScript who wants to explore one of the most exciting areas in 
the data-processing ecosystem right now, the exploding field of 
data-visualisation for the web. Its also about addressing some spe- 
cific pain-points which in my experience are quite common. 
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When you get commissioned to write a technical book, chances are 
your editor will sensibly caution you to think in terms of ‘pain 
points’ that your book aims to address. The two key pain points of 
this book are best illustrated by way of a couple of stories, one my 
own, the other one that has been told to me in various guises by Jav- 
aScripters I know. 

Many years ago, as an academic researcher, I came across Python 
and feli in love. I had been writing some fairly complex simulations 
in C(++) and Pythons simplicity and power was a breathe of fresh 
air from all the boilerplate, Makefiles, declarations and definitions 
and the like. Programming was fun, Python the perfect glue, playing 
nicely with my C(++) libraries (Python wasnt then and stili isnt a 
speed demon) and doing, with consummate ease, all the stuff that in 
low level languages is such a pain, e.g. file I/O, database access, seri- 
alisation etc.. I started to write all my graphical user interfaces 
(GUIs) and visualisations in Python, using wxPython, PyQt and a 
whole load of other refreshingly easy toolsets. Now theres some stuff 
there that I think is pretty cool but I doubt TU ever get around to the 
necessary packaging, version checking and various other hurdles to 
distribution, so no-one else will ever see it. 

At the time there existed what in theory was the perfect universal 
distribution system for the Software Td so lovingly crafted, namely 
the web-browser. Available on pretty much every computer on 
earth, with its own built-in, interpreted programming language, 
write once, run everywhere. But everyone knew that a. Python 
doesnt play in the web-browser’s sandpit and b. browsers were inca- 
pable of ambitious graphics and visualisations, being pretty much 
limited to static images and the odd jQuery transformation. Java¬ 
Script was a ‘toy’ language tied to a very slow interpreter good for 
little DOM tricks but certainly nothing approaching what I could do 
on the desktop with Python. So that route was discounted, out of 
hand. My visualisations wanted to be on the web but there was no 
route through. 

Fast forward a decade or so and, thanks to an arms race initiated by 
Google and their V8 engine, JavaScript is now orders of magnitude 
faster, in fact ifs now an awful lot faster than Pythonf HTML has 
also tidied up its act a bit, in the guise of HTML5. Its a lot nicer to 


1 See here for a fairly jaw-dropping comparison. 
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work with, with much less boilerplate. What were loosely followed 
and distinctly shaky protocols like Scalable Vector Graphics (SVG) 
have firmed up nicely thanks to powerful visualisation libraries, D3 
being preeminent. Modern browsers are obliged to work nicely with 
SVG and, increasingly, 3D in the form of WebGL and its children 
such as THREE.js. Those visualisations I was doing in Python are 
now possible on your local web-browser and the payoff is that, with 
very little effort, they can be made accessible to every desktop, lap- 
top, smartphone and tablet in the world. 

So why arent Pythonistas flocking to get their data out there in a 
form they dictate? After all, the alternative to crafting it yourself is 
leaving it to somebody else, something most data-scientists I know 
would find far from ideal. Well, first theres that term Web Develop- 
ment, connoting complicated markup, opaque stylesheets, a whole 
slew of new tools to learn, IDEs to master. And then theres Java¬ 
Script itself, a strange language, thought of as little more than a toy 
until recently and having something of the neither fish nor fowl to 
it. 1 aim to take those pain-points head-on and show that you can 
craft modern web-visualisations (often single page apps) with a very 
minimal amount of HTML and GSS boilerplate, allowing you to 
focus on the programming, and that JavaScript is an easy leap for 
the Pythonista, having a lot in common. But you dont have to leap, 
Ghapter 2 is a language-bridge, which aims to help Pythonistas and 
JavaScripters bridge the divide between the languages by highlight- 
ing common elements and providing simple translations. 

The second story is a common one 1 run into among JavaScript 
data-visualiers I know. Processing data in JavaScript is far from 
ideal. There are few heavyweight libraries and although recent func- 
tional enhancements to the language make data-munging much 
more pleasant, theres stili no real data-processing ecosystem to 
speak of So theres a distinet asymmetry between the hugely power¬ 
ful visualisation libraries available, D3 as ever paramount, and the 
ability to clean and process any data delivered to the browser. All of 
this mandates doing your data-cleaning, processing and exploration 
in another language or with a toolkit like Tableau and this often 
devolves into piecemeal forays into vaguely remembered Matlab, the 
steepish learning curve that is R or a Java library or two. 

Toolkits like Tableau, although very impressive, are often, in my 
experience, ultimately frustrating for programmers. Theres no way 
to replicate in a GUI the expressive power of a good, general pur- 
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pose programming language. Plus, what if you want to create a little 
web-server to deliver your processed data? That means learning at 
least one new web-dev capable language. 

In other words, JavaScripters starting to stretch their data visualisa- 
tion are looking for a complementary data-processing stack which 
requires the least investment of time and has the shallowest learning 
curve. 

Minimal requirements to use the book 

I always feel reluctant placing restrictions on peoples explorations, 
particularly in the context of programming and the web, which is 
chock full of auto-didacts (how else would one learn, the halls of 
academe being lightyears behind the trend?), learning fast and furi- 
ously, gloriously uninhibited by the formal constraints that used to 
apply to learning. Python and JavaScript are pretty much as simple 
as it gets, programming language wise, and are both top candidates 
for best first language. There isnt a huge cognitive load in interpret- 
ing the code. 

In that spirit, there are expert programmers who, without any expe- 
rience of Python and JavaScript, could consume this book and be 
writing custom libraries within a week. These are also the people 
most likely to ignore anything I write here so good luck to you peo¬ 
ple if you decide to make the effort. 

For beginner programmers, fresh to Python or JavaScript, this book 
is probably too advanced for you and Td recommend taking advan- 
tage of the plethora of books, web-resources, screencasts and the 
like that make learning so easy these days. Focus on a personal itch, 
a problem you want to solve and learn to program by doing - ifs the 
only way 

For people who have programmed a bit in either Python or Java¬ 
Script, my advised threshold to entry is that you have used a few 
libraries together, understand the basic idioms of your language and 
can look at a piece of novel code and generally get a hook on whafs 
going on. i.e. Pythonistas who can use a few modules of the Standard 
library and JavaScripters who can not only use Jquery but under¬ 
stand some of its source-code. 
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Why Python and JavaScript? 

Why JavaScript is an easy question to answer. For now and the fore- 
seeable future there is only one first class, browser-based program- 
ming language. There have been various attempts to extend, 
augment and usurp but good old plain vanilla JS is stili preeminent. 
If you want to craft modern, dynamic, Interactive visualisations and, 
at the touch of a button, deliver them to the world, at some point 
you are going to run into JavaScript. You might not need to be a zen 
master but basic competence is a fundamental price of entry into 
one of the most exciting areas of modern data Science. This book 
hopes to get you into the ballpark. 

Why not Python on the browser? 

There are currently some very impressive initiatives aimed at ena- 
bling Python produced visualisations, often built on Matplotlib, to 
run in the browser. This is achieved by converting the Python code 
into JavaScript based on the canvas or svg drawing contexts. The 
most popular and mature of these are Bokeh and the recently open- 
sourced Plotly. IPythons Jupyter project. While these are both bril- 
liant initiatives, I feel that in order to do web-based data-viz you 
have to bite the JavaScript bullet to exploit the increasing potential 
of the medium. Thafs why, along with space constraints, Tm not 
covering the Python to Javscript dataviz converters. 

While there is some brilliant coding behind these JavaScript con¬ 
verters and many solid use-cases, they do have big limitations: 

• Automated code-conversion may well do the job but the code 
produced is usually pretty impenetrable for a human being. 

• Adapting and customising the resulting plots using the power- 
ful browser-based JavaScript development environment is likely 
to be very painful. 

• You are limited to the subset of plot types currently available in 
the libraries. 

• Interactivity is very basic at the moment. Stitching this together 
is better done in JavaScript, using the browser s dev-tools. 

Bear in mind that the people building these libraries have to be Java¬ 
Script experts so if you want to understand anything of what theyre 


X I Introduction 



doing and eventually express yourself then you’11 have to get up to 
scratch with some JavaScript. 

My basic take-home with Python-to-JavaScript conversion is that it 
has its place but would only be generally justifled were JavaScript 
ten times harder to program than it is. The fiddly, iterative process 
of creating a modern browser-based data-visualisation is hard 
enough using a flrst-class language without having to negotiate an 
indirect journey through a second-class one. 

Why Python for data-processing 

Why you should choose Python for your data-processing needs is a 
little more involved. For a start there are good alternatives as far as 
data-processing is concerned. Lefs deal with a few candidates for the 
job, starting with the enterprise behemoth Java: 

Java 

Among the other main, general-purpose programming languages, 
only Java offers anything like the rich ecosystem of libraries that 
Python does, with considerably more native speed too. But while 
Java is a lot easier to program in than, say, C-t-t it isnt, in my opin- 
ion, a particularly nice language to program in, having rather too 
much in the way of tedious boilerplate and excessive verbiage. This 
sort of thing starts to weigh after a while and makes for a hard slog 
at the code-face. As for speed, Pythons default interpreter is slow 
but Python is a great glue language, which plays nicely with other 
languages. This ability is demonstrated by the big Python data- 
processing libraries like Numpy (and its dependent Pandas), Scipy 
and the like, which use C(++) and Fortran libraries to do the heavy 
lifting while providing the ease of use of a simple, scripting lan¬ 
guage. 

R 

The Venerable R has, until recently, been the tool of choice for many 
data-scientists and is probably Pythons main competitor in the 
space. Like Python, R benefits from a very active community, some 
great tools like the plotting library ggplot and a syntax specially craf- 
ted for data-science and statistics. But this specialism is a double- 
edged sword. Because R was developed for a specific purpose, it 
means that if, for example, you wish to write a web-server to serve 
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your R processed data, you have to skip out to another language, 
with all the attendant learning overheads, or try and hack something 
together, round hole, square-peg wise. Pythons general purpose 
nature and its rich ecosystem mean one can do pretty much every- 
thing required of a data-processing pipeline (JS visuals aside) 
without having to leave its comfort zone. Personally, it is a small sac¬ 
rifice to pay for a little syntactic clunkiness. 

Others 

There are other alternatives to doing your data-processing with 
Python but none of them come close to the flexibility and power 
afforded by a general purpose, easy to use programming language 
with a rich ecosystem of libraries. While, for example, mathematical 
programming environments such as Matlab and Mathematica have 
active communities and a plethora of great libraries, they hardly 
count as general purpose, designed to be used within a closed gar- 
den. They are also proprietary, which means a significant initial 
investment and a different vibe to the Pythons resoundingly open- 
source environment. 

GUI-driven data-viz tools like Tableau are great creations but will 
quickly frustrate someone used to the freedom to programming. 
They tend to work great as long as you are singing from their song- 
sheet, as it were. Deviations from the designated path get painful 
very quickly. 

Python's getting better all the time 

As things stand, I think a very good case can be made for Python 
being the budding data-scientist’s language of choice. But things are 
not standing stiU, in fact Pythons capabilities in this area are grow- 
ing at an astonishing rate. To put it in perspective, I have been pro¬ 
gramming in Python for over flfteen years, have grown used to 
being surprised if I eant find a Python module to help solve a prob- 
lem at hand, but I find myself surprised at the growth of Pythons 
data-processsing abiliites, with a new, powerful library appearing 
weekly. To give an example, Python has traditionally been weak on 
statistical analysis libraries, with R being far ahead here. Recently a 
number of powerful modules, such as StatsModel have started to 
close this gap fast. 
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So Python is a thriving data-processing ecosystem with pretty much 
unmatched general purpose and it’s getting better week on week. Its 
understandable why so many in the community are in a state of such 
excitement - it’s pretty exhilarating. 

As far as visualisation in the browser, the good news is that theres 
more to JavaScript than its privileged, nay exclusive place in the 
web-ecosystem. Thanks to an interpreter arms race which has seen 
performance increase in staggering leaps and bounds and some 
powerful visualisation libraries, such as D3, which would comple- 
ment any language out there, JavaScript now has serious chops. 

In short, Python and JavaScript are a wonderful complement for 
data-visualisation on the web, each needing the other to provide a 
vital missing component. 

WhatYou'll Leam 

There are some big Python and JavaScript libraries in our dataviz 
toolchain and comprehensive coverage of them aU would require a 
number of books. Nevertheless I think that the fundamentals of 
most libraries, and certainly the ones covered here, can be grasped 
fairly quickly. Expertise takes time and practice but the basic knowl- 
edge needed to be productive is, so to speak, low-hanging fruit. 

In that sense this book aims to give you a solid backbone of practical 
knowledge, strong enough to take the weight of future development. 
I aim to make the learning curve as shallow as possible and get you 
over the initial climb, with the practical skills needed to start refin- 
ing your art. 

This book emphasises pragmatism and best-practice. Ifs going to 
cover a fair amount of ground and there isn’t enough space for too 
many theoretical diversions. I will aim to cover those aspects of the 
libraries in the toolchain that are most commonly used, and point 
you to resources for the other stuff. Most libraries have a hard-core 
of functions, methods, classes and the like that are the chief, func- 
tional subset. With these at your disposal you can actually do stuff. 
Eventually you find an itch you eant scrateh with those at which 
time good books, documentation and online forums are your friend. 
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The Choice of Libraries 

I had three things in mind while choosing the libraries used in the 
book. 

1. Open-source and free as in beer - you shouldnt have to invest 
any extra money to learn with this book. 

2. Longevity - generaUy weU-established, community-driven and 
popular. 

3. Best-of-breed, allowing for 2. - at the sweet-spot between popu- 
larity and utility. 

The skills you learn here should be relevant for a long time. Gener- 
ally the obvious candidates have been chosen, libraries that write 
their own ticket as it were. Where appropriate 1 wiU highlight the 
alternative choices and give a rationale for my selection. 

Preliminaries 

A few preliminary chapters are needed before beginning the trans- 
formative journey of our Nobel data-set through the toolchain. 
These aim to cover the basic skills necessary to make the rest tooT 
chain chapters run more fluidly. The first few chapters cover the foT 
lowing: 

• Building a Language bridge between Python and JavaScript. 

• Covering the Basic Web-dev needed by the book. 

• How to pass data around with Python, through various file- 
formats and databases. 

These chapters are part-tutorial, part-reference and ifs fine to skip 
straight to the beginning of the toolchain, dipping back where 
needed. 

The Dataviz Toolchain 

The main part of the book demonstrates the data-visualisation tooT 
chain, which foUows the journey of a data-set (of NobeTprize win- 
ners) from raw, freshly scraped data to engaging, interactive 
JavaScript visualisation. During the process of collection, refinement 
and transformation a number of big libraries are demonstrated, 
summarised in Figure P-2. These libraries are the industrial lathes of 
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our toolchain, rich, mature tools which demonstrate the power of 
the Python + JavaScript dataviz stack. Heres a brief introduction to 
the five stages of our toolchain and their major libraries. 


5. TRANSFORM 


D3 

Wikipedia Nobel-page Interactive Nobel-visualisation 


ChtmQ 


database/files 




1. SCRAPI 

Scrapy 



DELIVER 

FlaskRESTfuI API 


2. CLEAN 3. EXPLORE/PROCESS 

Pandas IPython + Pandas + Matplotiib 


Figure P-2. The Data-viz Toolchain 


1. Scraping data with Scrapy 


The first challenge for any data-visualiser is getting hold of the data 
they need, whether by request or to scratch a personal itch. If youre 
very lucky this wiU be delivered to you in pristine form but more 
often than not you have to go find it. TU cover the various ways you 
can use Python to get data off the web (e.g. web-APIs or google- 
spreadsheets) but the Nobel-prize data-set for the toolchain demon- 
stration is scraped^ from its Wikipedia pages using Scrapy. 

Pythons Scrapy is an industrial strength scraper that does ali the 
data-throttling, media pipelining etc. that are indispensable if you 
plan on scraping significant amounts of data. Scraping is often the 
only way to get the data you are interested in and once you’ve mas- 
tered Scrapy’s workflow all those previously off-limits datasets are 
only a spider^ away 


2 Web scraping is a computer Software technique to extract information from websites, 
usually involving getting and parsing web-pages. 

3 Scrapy’s controllers are called spiders. 


Introduction | xv 








2. Cleaning data with Pandas 

The dirty secret of data-viz is that pretty much all data is dirty and 
turning it into something you can use may well occupy a lot more 
time than anticipated. This is an unglamourous process which can 
easily steal over half your time. Which is all the more reason to get 
good at it and use the right tools. 

Pandas is a huge player in the Python data-processing ecosystem. Its 
a Python data-analysis library whose chief component is the Data- 
Frame, essentially a programmatic spreadsheet. Pandas extends 
Numpy, Pythons powerful numeric library, into the realm of hetero- 
geneous data-sets, the kind of categorical, temporal, ordinal etc. 
information that data-visualisers have to deal with. As well as being 
great for interactively exploring your data (using its built-in Mat- 
plotlib plots), Pandas is well-suited to the drudge-work of cleaning 
data, making it easy to locate duplicate records, fix dodgy date- 
strings, find missing fields etc.. 

3. Exploring data with Pandas and Matplotiib 

Before beginning the transformation of your data into a visualisa- 
tion you need to understand it. The patterns, trends, anomalies etc. 
hidden in the data will inform the stories you are trying to teli with 
it, whether thafs explaining a recent rise in year-on-year widget sales 
or demonstrating global climate change. 

In conjunction with IPython, the Python interpreter on steroids. 
Pandas and Matplolib (with additions such as Seaborn) provide a 
great way to explore your data interactively, generating rich, inline 
plots from the command-line, slicing and dicing your data to reveal 
interesting patterns. The results of these explorations can then be 
easily saved to file or database to be passed on to your JavaScript vis- 
ualisation. 

4. Delivering your data with Flask 

Once youVe explored and refined your data you’11 need to serve it to 
the web-browser, where a JavaScript library Irke D3 can transform it. 
One of the great strengths of using a general purpose language like 
Python is that ifs as comfortable rolling a web-server in a few lines 
of code as it is crunching through large datasets with special- 
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purpose libraries like Numpy and Scipy'‘. Flask is Pythons most pop¬ 
ular lightweight server and is perfect for creating a small, RESTfuP 
APIs which can be used by JavaScript to get data from the server, in 
fdes or databases, to the browser. As FU demonstrate, you can roU a 
RESTful API in a few lines of code, capable of delivering data from 
SQL or NoSQL databases. 

5. Transforming the data into Interactive visualisations 
with D3 

Once the data is cleaned and refined, we have the visualisation 
phase, where selected reflections of the data-set are presented, ide- 
ally allowing the user to explore them interactively. Depending on 
the data this might involve barcharts, maps or novel visualisations. 

D3 is JavaScripts powerhouse visualisation library, arguably one of 
the most powerful visualisation tools irrespective of language. We’11 
use D3 to create a novel Nobel-prize visualisation with multiple ele- 
ments and user interaction, allowing people to explore the dataset 
for items of interest. D3 can be challenging to learn but I hope to 
bring you quickly up to speed, ready to start honing your skills in 
the doing. 

Smaller Libraries 

As well as the big libraries covered, there is a large supporting cast of 
smaller libraries. These are the indispensable smaller tools, the ham- 
mers, spanners etc. of the toolchain. Python in particular has an 
incredibly rich ecosystem, with small, specialised libraries for almost 
every conceivable job. Among the strong supporting cast, some par- 
ticularly deserving of mention are: 

• requests: Pythons go-to HTTP library, fully deserving its 
motto ‘HTTP for humans’, requests is far superior to urllib2, 
one of Pythons included batteries. 


4 The Scientific Python library, part of the Numpy ecosystem. 

5 REST is short for Representational State Transfer, the dominant style for HTTP-based 
web-APIs and much recommended. 
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• SQLAlcheny: the best Python SQL Toolkit and Object Relational 
Mapper (ORM) there is. It’s feature rich and makes working 
with the various SQL-based databases a relative breeze. 

• Seaborn: a great addition to Pythons plotting powerhouse Mat- 
plotlib, adding some very useful plot-types including some stat- 
istical ones of particular use to data-visualisers. It also adds 
arguably superior aesthetics, over-riding the Matplotlib defaults. 

• crossfilter: although JavaScripts data-processing libraries are 
a work-in-progress, a few really useful ones have emerged 
recently with crossfilter being a stand-out. It enables very 
fast flltering of row-columnar datasets and is ideally suited to 
data-viz work, unsurprising as one of ifs creators is Mike 
Bostock, the father of D3. 

ALittIeBit ofContext 

This is a practical book and assumes that the reader has a pretty 
good idea what he or she wants to visualise, how that visualization 
should look and feel and a desire to get cracking on, unencumbered 
by too much theory. Nevertheless, drawing on the history of data- 
visualisation can both clarify the Central themes of the book and add 
valuable context. It can also help explain why now is such an excit- 
ing time to be entering the fleld, as technological innovation is driv- 
ing novel data-viz forms and people are grappling with the problem 
of presenting the increasing amount of multi-dimensional data gen- 
erated by the internet. 

Data visualisation has an impressive body of theory behind it and 
there are some great books out there that I would recommend you 
read (see ??? on page 21 for a little selection). The practical benefit of 
understanding the way humans visually harvest Information cannot 
be overstated. It can be easily demonstrated, for example, that a pie- 
chart is almost always a bad way of presenting comparative data and 
a simple bar-chart far preferable. By conducting psychometric 
experiments, we now have a pretty good idea how to trick the 
human visual System and make the relationships in the data harder 
to grasp. Conversely we can show that some visual forms are close to 
optimal for amplifying contrast. The literature, at its very least, pro¬ 
vides some useful rules of thumb which suggest good candidates for 
any particular data narrative. 
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In essence good data-viz tries to present data, collected from meas- 
urements in the world (empirical) or maybe the product of abstract 
mathematical explorations (e.g. the beautiful fractal patterns of the 
Mandlebrot set, in such a way as to draw out or emphasise any pat¬ 
terns or trends that might exist. These patterns can be simple, e.g. 
average weight by country, or the product of sophisticated statistical 
analysis, e.g. data-clustering in a higher dimensional space. 




Figure P-3. (a) An early spreadsheet (b) Joseph Minards visualisation 
ofNapoleons Russian campaign ofl812 

In its untransformed state, we can imagine this data floating as a 
nebulous cloud of numbers or categories. Any patterns or correla- 
tions are entirely obscure. Its easy to forget but the humble spread¬ 
sheet (Figure P-3 a.) is a data-visualisation, the ordering of data into 
row-columnar form an attempt to tame it, make its manipulation 
easier and highlight discrepancies (e.g. actuarial book-keeping etc..). 
Of course, most people are not adept at spotting patterns in rows of 
numbers, so more accessible, visual forms were developed to engage 
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with our visual-cortex, the prime human conduit for information 
about the world. Enter the bar-chart, pie-chart'’, line-chart etc. More 
imaginative ways were employed to distil statistical data in a more 
accessible form, one of the most famous being Charles Joseph Min- 
ard’s visualisation of Napoleons disastrous Russian campaign of 
1812 (Figure P-3 b.). 

The tan colored stream in Figure P-3 b. shows the advance of Napo¬ 
leons army on Moscow, the black line the retreat. The thickness of 
the stream represents the size of Napoleaons army, thinning as casu- 
alties mounted. A temperature chart below is used to indicate the 
temperature at locations along the way Note the elegant way in 
which Minard has combined multi-dimensional data (casualty sta- 
tistics, geographical location and temperature) to give an impression 
of the carnage which would be hard to grasp in any other way 
(imagine trying to jump from a chart of casualties to a list of loca¬ 
tions and make the necessary connections). I would argue that the 
chief problem of modern Interactive data-viz is exactly that faced by 
Minard: how to move beyond conventional one-dimensional bar- 
charts etc. (perfectly good for many things) and develop new way to 
communicate cross-dimensional patterns effectively 

Until quite recently, most of our experience of charts was not much 
different from those of Charles Minards audience. They were pre- 
rendered, inert and showed one reflection of the data, hopefully an 
important and insightful one but nevertheless under total control of 
the author. In this sense the replacement of real ink-points with 
computer screen pixels was only a change in the scale of distribu- 
tion. 

The leap to the internet just replaced newsprint with pixels, the visu¬ 
alisation stili being unclickable and static. Recently the combination 
of some powerful visualisation libraries (D3 preeminent among 
them) and a massive improvement in JavaScripfs performance have 
opened the way to a new type of visualization, one that is easily 
accessible, dynamic and actually encourages exploration and discov- 
ery. The ciear distinction between data exploration and presentation 
is blurred. This new type of data visualisation is the focus of this 
book and the reason why data-viz for the web is such an exciting 
area right now. People are trying to create new ways to visualize data 


6 William Playfairs Statistical Breviary of 1801 having the dubious distinction of origin 
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and make it more accessible/useful to the end-user. This is nothing 
short of a revolution. 

Summary 

Dataviz on the web is an exciting place to be right now with innova- 
tions in interactive visualisations coming thick and fast, many (if not 
most) of them being developed with D3. JavaScript is the only 
browser-based language so the cool visuals are by necessity being 
coded in it (or converted into it). But JavaScript lacks the tools or 
environment necessary for the less dramatic but just as vital element 
or modern data-viz, the aggregation, curation and processing of the 
data. This is where Python rules the roost, providing a general- 
purpose, concise and eminently readable programming language 
with access to an increasing stable of first-class data-processing 
tools. Many of these tools leverage the power of very fast, low-level 
libraries making Python data-processing fast as well as easy. 

This book aims to introduce some of those heavy-weight tools as 
well as a host of other smaller but equally vital tools. It also aims to 
show how Python and JavaScript in concert represent the best data¬ 
viz stack out there for anyone wishing to deliver their visualisations 
to the internet. 

Up next is the first part of the book covering the preliminary skills 
needed for the toolchain. You can work through it now or skip 
ahead to part two and the start of the toolchain, referring back when 
ne 

Recommended Books 

Heres a few key data-visualisation books to whet your appetite, cov¬ 
ering the gamut from interactive dashboards to beautiful and 
insightful info-graphics. 

• The Visual Display of Quantitative Information. Edward Tufte. 
Graphics Press, 1983. 

• Information Visualization: Perception for Design. Colin Ware. 
Morgan Kaufmann, 2004. 

• Cartographies of Time: A History of the Timeline. Daniel 
Rosenberg. Princeton Architectural Press, 2012. 
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• Information Dashboard Design: Displaying Data for at-a-glance 
Monitoring. Stephen Few. Analytics Press, 2013. 

• The Functional Art. Alberto Cairo. New Rrders 2012. 

• Semiology of Graphics: Diagrams, Networks, Maps. Jacques 
Bertin. Esri Press 2010. 
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CHAPTER1 


A Development Setup 


This chapter will cover the downloads and Software installations 
needed to use this book as well as sketching out a recommended 
development environment. As you’11 see, this isnt as onerous as it 
might once have been. TU cover Python and JavaScript dependen- 
cies separately and give a brief overview of cross-language IDEs. 

Python 

The bulk of the libraries covered in the book are Python-based but 
what might have been a challenging attempt to provide comprehen¬ 
sive installation instructions for the various Operating Systems and 
their quirks is made much easier by the existence of Continuum 
Analytics Anaconda, a Python platform which bundles together 
most of the popular analytics libraries in a convenient package. 

Anaconda 

Installing some of the bigger Python libraries used to be a challenge 
all in itself, particularly those such as Numpy which depend on 
complex low-level C and Fortran packages. That s why the existence 
of Anaconda is such a God-send. It does all the dependency check- 
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ing, binary installs etc. so you don’t have to. Its also a very conve 
nient resource for a book like this. 


Python 2 or 3? 

Right now Python is in transition to version 3, a process which is 
taking longer than many would like. This is because Python 2+ 
Works fine for many people, a lot of code will have to be converted' 
and up until recently some of the big libraries, such as Numpy and 
Scipy, only worked for 3. 

Now that most of the major libraries are Python 3 compatible it 
would be a no-brainer to recommend that version for this book. 
Unfortunately one of the few hold-outs, not yet v3. ready, is Scrapy, 
a big tool on our tool-chain^ which you’11 learn about in Chapter 6. 
I dont want to oblige you to run two versions so for that reason 
we’ll be using the version 2. Anaconda package. 

I wUl be using the new print function^ which means all the non- 
Scrapy code will work fine with Python 3. 


To get your free Anaconda install, just navigate your browser to 
https://www.continuum.io/downloads, choose the version for your 
Operating System (as of late 2015, the we’re going with Python 2.7), 
and follow the instructions. Windows and OSX get a graphical 
installer (just download and double-click) while Linux requires you 
to run a little bash script: 

$ bash Anaconda-2.3.0-Llnux-x86_64.sh 
I recommend sticking to defaults when installing Anaconda. 

Checking the Anaconda install 

The best way to check your Anaconda install went ok is to try firing 
up an IPython session at the command-line. How to do this depends 
on your operating system: 


1 There are a number of pretty reliable automatic converters out there. 

2 The Scrapy team are working hard to rectify this. Scrapy relies on Pythohs Twisted, an 
event-driven networking engine also making the journey to Python 3+ compatability. 

3 This is imported from future module, i.e. ' ixom. future import print_function. 
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At the Windows command-prompt: 

C: \Users\Kyran>ipython 
At the OS-X or Linux prompt: 

$ Ipython 

This should produce something like the following: 

kyran@Tweedledum:~/projects/pyjsbook$ Ipython 
Python 2.7.10 |Anaconda 2.3.0 (64-hlt)| 

(defautt, May 28 2015, 17:02:03) Type 
"Copyright", "credits" or "ticense" for nore infornation. 

IPython 3.2.0 -- An enhanced Interactive Python. Anaconda is 
hrought to you hy Continuun Analytics. Please check out: 
http://continuum.io/thanks and 
https://anaconda.org 


Most installation problems will stem from a badly configured envi- 
ronment Path variable. This Path needs to contain the location of 
the main Anaconda Directory and its Scripts sub-directory. In Win¬ 
dows this should look something like: 

' ...C: \\Anaconda;C: \\Anaconda\Scripts. .. 

You can access and adjust the environment variables in Windows 7 
by typing “environment variables” in programs search fleld and 
selecting Edit environment variables or in XP via Control 
Panel > System > Advanced > Environment Variables. 

In OS-X and Linux Systems you should be able to set your PATH vari¬ 
able explicitly by appending this line to the .bashrc file in your 
horne directory: 

export PATH=/home/kyran/anaconda/hin:$PATH 

Installing extra libs 

Anaconda contains almost all the Python libraries covered in this 
book (see here for the fuU list of Anaconda libraries). Where we 
need a non-Anaconda library we can use pip (short for Pip Installs 
Python), the defacto Standard for installing Python libraries. Using 
pip to install is as easy as can be, just call pip install followed by 
the name of the package from the command-line and it should be 
installed or, with any luck, give a sensible error: 

$ pip install dataset 
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Virtual Environments 


http://docs.python -gulde.org/en/latest/dev/vlrtualenvs/ [Virtual Envlronnents] pro 

Anaconda comes with a conda system command that makes creat- 
ing and using virtual-environments easy. Let’s create a special one 
for this book, based on the full Anaconda package: 

$ conda create --name pyjsvlz anaconda 
# 

# To activate this environnent, use: 

# ^ source activate pyjsviz 

# 

# To deactivate this environnent, use: 

# ^ source deactivate 

# 

As the final message says, to use this Virtual environment you need 
only source activate it (for Windows machines you can leave out 
the source): 

$ source activate pyjsvlz 

dlscardlng /home/kyran/anaconda/bln from PATH 
prependlng /home/kyran/.conda/envs/pyjsvlz/bln to PATH 
(pyjsvlz) $ 

Note that you get a helpful cue at the command-line to let you know 
which Virtual environment youre using. 

The conda command can do a lot more than just facilitate Virtual 
environments, combining the functionality of Pythons pip installer 
and virtualenv command, among other things. You can get a full 
run-down here. 

JavaScript 

The good news is that you dont need much JavaScript Software at 
all. The only must-have is the Chrome/Chromium web-browser, 
which is used in this book. It offers the most powerful set of devel- 
oper tools of any current browser and is cross-platform. 

To download Chrome just go here and download the version for 
your operating system. This should be automatically detected. 

If you want something slightly less Googlefied then you can use 
Chromium, the browser based on the open-source project from 
which Google Chrome is derived. You can find up to date instruc- 
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tions on installation here or just head on to the main download 
page. Chromium tends to lag Chrome feature-wise but is stili an 
eminently usable development browser. 

Content Delivery Networks (CDNs) 

One of the reasons you dont have to worry about installing Java¬ 
Script libraries is that the ones used in this book are available via 
content delivery networks (CDNs). Rather than having the libraries 
instaUed on your local machine, the JavaScript is retrieved by the 
browser over the web, from the closest available server. This should 
make things very fast - faster than if you served the content yourself. 

To include a library via CDN you use the usual <scrlpt> tag, usu- 
ally placed at bottom of your HTML page. For example, his call adds 
the latest (as of late 2015) version of D3: 

<scrlpt 

src="https://cdnjs.cloudflare.con/ajax/ltbs/d3/3.5.6/d3.ntn.js" 
charset="utf-8"> 

</script> 

Installing libraries locally 

If you need to install JavaScript libraries locally, e.g. you anticipate 
doing some offline development work or eant guarantee an internet 
connection, there are a number of fairly simple ways. 

You can just download the separate libraries and put them in your 
local servers static folder. This is a typical folder-structure. Third- 
party libraries go in the static/libs directory off root, like so: 

nobel_vt 2 / 

'— static 
K ess 
I — data 
I — libs 

I '— d3.nin.js 

'— js 

If you organise things this way, to use D3 in your Scripts now 
requires a local file reference with the <script> tag: 

<script src="/stattc/libs/d3.min. js"></script> 
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Databases 

This book shows how to interact with the main SQL databases and 
MongoDB, the chief non-relational or NoSQL database, from 
Python. We’ll be using SQLite, the brilliant file-based SQL database. 
Heres the download details for SQLite and MongoDB: 

• SQLite - a great, file-based, serverless SQL-database. It should 
come as Standard with Mac QS-X and Linux. For Windows, fol- 
low this guide. 

• MongoDB - by a long way the most popular NoSQL database. 
InstaUation instructions here. 

Note that we’ll either be using Pythons SQLAlchemy SQL-library 
directly or through libraries that build on it. This means any SQLite 
examples can be converted to another SQL backend (e.g. MySQL or 
PostgreSQL) by changing a configuration line or two. 

Integrated Development Environments 

As I explain in “The myth of IDEs, frameworks and tools” on page 
108, I dont think you need an IDE to program in Python or Java¬ 
Script. The development tools provided by modern browsers. 
Chrome in particular, mean you only really need add a good code- 
editor to have pretty much the optimal setup. Ifs free as in beer too. 

For Python, I have tried a few IDEs but theyVe never stuck. The 
main itch I was trying to scratch was a decent debugging system. 
Setting breakpoints etc. in Python with a text-editor isn’t particularly 
elegant and using the command-line debugger pdb feels a little too 
old-school sometimes. Nevertheless, Pythons logging etc. is so easy 
and effective that breakpoints were an edge-case which didnt justify 
leaving my favourite editor'*, which does pretty decent code- 
completion, solid syntax-highlighting etc.. 

In no particular order, here are a few IVe tried and not disliked: 

• PyCharm - solid code-assistance etc., good debugging. 


4 Emacs with VIM key-bindings. 
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• PyDev - if you like Eclipse and can tolerate it’s rather large foot- 
print, this might well be for you. 

• WingIDE is a solid bet, with a great debugger and incremental 
improvements over a decade and a half development. 

Summary 

With free, packaged Python distributions such as Anaconda and the 
inclusion of sophisticated Javscript development tools in freely avail- 
able web-browsers, the necessary Python and JavaScript elements of 
your dev environment are a couple of clicks away Add a favourite 
editor and a database of choice' and you are pretty much good to go. 
There are additional libraries such as node.js which can be useful 
but don’t count as essential. Now weVe established our program- 
ming environment, the next chapters will teach the preliminaries 
needed to start our journey of data-transformation along the tool- 
chain. Starting with a language bridge between Python and Java¬ 
Script. 


1 SQLite is great for development purposes and doesnt need a server running on your 
machine. 


Summary | 29 





PARTI 


ABasicToolkit 


This first part of the book provides a basic toolkit for the toolchain to 
come and is part tutorial, part reference. Given the fairly wide range of 
knowledge in the books target audience, there will probably be things 
covered that you already know. My advice is just to cherry-pick the 
material to fill any gaps in your knowledge and maybe skim what you 
already know as a refresher. 

If youre confident you already have the basic toolkit to hand, feel 
free to skip to the start of our journey along the toolchain in Part II. 
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CHAPTER 2 


A Language Learning Bridge 
Between Python and JavaScript 


Probably the most ambitious aspect of this book is that it deals with 
two programming languages. Moreover, it only requires that you are 
competent in one of these languages. This is only possible because 
Python and JavaScript (JS) are fairly simple languages with much in 
common. The aim of this chapter is to draw out those commonali- 
ties and use them to make a learning-bridge between the two lan¬ 
guages such that core skills acquired in one can easily be applied to 
the other. 

After showing the key similarities and differences between the two 
languages TU show how set up a learning environment for Python 
and JS. The bulk of the chapter will then deal with core syntactical 
and conceptual differences, followed by a selection of patterns and 
idioms that I use a lot while doing data visualisation work. 

Similarities and differences 

Syntax differences aside, Python and JavaScript actually have a lot in 
common. Affer a short whde, switching between them can be 


33 




almost seamless\ Let’s compare the two from a data-visualiser’s per¬ 
spectiva: 

These are the chief similarities 

• They both work without needing a compilation step (i.e. they 
are interpreted). 

• You can use both with an Interactive interpreter, which means 
you can type in lines of code and see the results right away 

• Both have garbage coUection. 

• Neither language has header files, package boilerplate etc.. 

• Both are primarily developed with a text-editor not an IDE. 

• In both, functions are first class citizens which can be passed as 
arguments etc.. 

Their key differences 

• Possibly the biggest difference is that JavaScript is single- 
threaded and non-blocking, using asynchronous I/O. This 
means simple things Irke file-access involve the use of a caUback 
function. 

• JS is used essentially in web-development, until very recently 
being Browser bound^ but Python is used almost everywhere. 

• JS is the only first class language in web-browsers, Python being 
excluded. 

• Python has a comprehensive Standard library whereas JS a limi- 
ted set of utility objects, e.g. JSON, Math. 

• Python has fairly classical Object Oriented classes whereas JS 
uses prototypes. 

• JS lacks general-purpose data-processing libs^ 

The differences here emphasise the need for this book to be bi- 
lingual. JavaScripfs monopoly of Browser dataviz needs the comple- 


1 One particularly annoying little gotcha is that while Python uses pop to remove a list 
item, it uses append not push to add an item. Javscript uses push to add an item while 
append is used to concatenate arrays. 

2 The ascent of node.js has extended JavaScript to the server. 

3 This is changing with libraries like crossfilter but JS is far behind Python, R and others. 
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ment of a conventional data-processing stack. And Python has the 
best there is. 

Interacting with the Code 

One of the great advantages of Python and JavaScript is that because 
they are interpreted on-the-fly, you can interact with them. Pythons 
interpreter can be run from the command-line while JavaScripfs is 
generally accessed from the web-browser through a console, usually 
available from the in-built development tools. In this section we’11 
see how to fire up a session with the interpreter and start trying out 
your code. 

Python 

By far the best Python interpreter is IPython, which comes in three 
shades, the basic terminal version, an enhanced graphical version 
and a notebook. The notebook is a wonderful and fairly recent inno- 
vation, providing a browser-based interactive computational envi- 
ronment. There are pros and cons to the different versions. The 
command-line is fastest to scratch a problematic itch but lacks some 
bells and whistles, particularly embedded plotting courtesy of Mat- 
plotlib and friends. The makes it sub-optimal for Pandas based data- 
processing and data-visualisation work. Of the other two, both are 
better for multi-line coding (trying out functions etc.) than the basic 
interpreter but I find the graphical qtconsole more intuitive, having a 
familiar command-line rather than executable cells^ . The great 
boon of the notebook is session persistence and the possibility of 
web-access^. The ease with which one can share programming ses- 
sions, complete with embedded data-visualisations, makes the note¬ 
book a fantastic teaching tool, as well as a great way to recover 
programming context. 

You can start them at the command-line like this 
$ Ipython [qt | notebook] 

options can be empty, for the basic command-line interpreter, -qt 
for a Qt based graphical version and -notebook for the Browser- 


4 This version is based on the Qt GUI library. 

5 At the cost of a running a Python interpreter on the server. 
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based notebook. You can use any of the three IPython alternatives 
for this section but for serious interactive data-processing I generally 
find myself gravitating to the Qt console for sketches or the note¬ 
book if I anticipate an extensive project. 

JavaScript 

There are lots of options for trying out JavaScript code without 
starting a server, though the latter isn’t that difficult. Because the 
JavaScript interpreter comes embedded in all modern web-browsers, 
there are a number of sites that let you try out bits of JavaScript 
along with HTML and CSS and see the results. JSBin is a good 
option. These sites are great for sharing code, trying out snippets 
etc. and usually allowyou to add libraries such as D3.js. 

If you want to try out code one-liners or quiz the state of live code, 
browser-based consoles are your best bet. With Chrome you can 
access the console with the key-combo Ctrl-Shift-J. As well as 
trying little JS snippets, the console allows you to drill-down into 
any objects in scope, revealing their methods and properties. This is 
a great way to quiz the state of a live object and search for bugs. 

One disadvantage of using on-line JavaScript editors is losing the 
power of your favourite editing environment, with linting, familiar 
keyboard-shortcuts and the like (see Chapter 4). On-line editors 
tend to be rudimentary, to say the least. If you anticipate an exten¬ 
sive JavaScript session and want to use your favourite editor, the best 
bet is to run a local server. 

First, create a project directory, caUed sandpit for example, and add 
a minimal HTML file which includes a JS script: 

sandpit 
I— index.html 
'— script.js 

The index.html file need only be a few lines long, with an optional 
div place-holder on which to start building your visualisation or just 
trying out a little DOM-manipulation. 

index.html --> 

<!D0CTYPE htrnl> 

<neta charset="utf-8"> 

<div id= 'viz' ></div> 
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<script type="text/ javascript" src="script. js" async></script> 
You can then add a litde JavaScript to your script.js file: 

// script.js 

var data = [3, 7, 2, 9, 1, 11]; 
var total = 0; 

var sun = data.forEach(function(d){ 
total += d; 

}); 

console.log( 'Sum = ' + sum); 

// outputs 'Sum = 33' 

Start your development server in the project directory 

$sandpit python -m StmpleHTTPServer 
Serving HTTP on 0.0.0.0 port 8000 ... 

Then open your browser at http://localhost:8000, press Ctrl-Shift-J 
(Cmd + Opt + J on a Mac) to access the console and you should see 
Figure 2-1, showing the logged output of the script (see Chapter 4 
for further details). 


Q Elements NetWork Sources Timeline Profiles Resources Audits i Console i 
V <topframe> ▼ QPreservelog 
Sum = 33 

> 


Figure 2-1. Outputting to the Chrome console 

Now weVe established how to run the demo code, lets start building 
a bridge between Python and JavaScript. First, we’ll cover the basic 
differences in syntax. As you’11 see, theyre fairly minor and easily 
absorbed. 

Basic Bridge Work 

In this section TU contrast the basic nuts and bolts of programming 
in the two languages. 

Style guidelines, PEP 8 and 'use striet' 

Where JavaScript style guidelines are a bit of a free for all (with peo- 
ple often defaulting to those used by a big library like jQuery), 
Python has a Python Enhancement Proposal (PEP) dedicated to it. 


Do not ciear log on page reload / navigation. 
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rd recommend getting acquainted with PEP-8 but not submitting 
totally to its leadership. It’s right about most things but theres room 
for some personal choice here. Theres a handy on-line checker here 
which will pick up any infractioris of PEP-8. 

In Python you should use four spaces to indent a code-block. Java¬ 
Script is less striet but two spaces is the most common indent. 

One recent addition to Javscript (Ecmascript 5) is the use striet 
directive, which imposes striet mode. This mode enforces some 
good Javscript practice, which includes catehing accidental global 
declarations and I thoroughly recommend its use. To use it just 
place the string at the top of your function or module: 

(function(foo){ 

'use Street'; 

// ... 

}(wtndow.foo = window.foo || {}); 

Camel-case vs underscore 

JS conventionally uses camel-case (e.g. processStudentData) for its 
variables while Python, in accordance with PEP-8, uses underscores 
(e.g. process_student_data) in its variable names (Example 2-4 and 
Example 2-3 B). By convention (and convention is more important 
in the Python ecosystem than JS) Python uses capitalised camel-case 
for class declarations (see below), uppercase for constants and 
underscores for everything else: 

F00_C0NST = 10 

class FooBar(object) : # ... 

def foo_bar(): 

baz_bar = 'sone string' 

Importing modules, including Scripts 

Using other libraries in your code, either your own or third party, is 
fundamental to modern programming. Which makes it all the more 
surprising that JavaScript doesnt really have a mechanism for doing 
it'’. Python has a simple import system which, on the whole, works 
pretty well. 


6 The constraint of having to deliver JS Scripts over the web via HTTP is largely responsi- 
ble for this. 
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The good news on the JavaScript front is that Ecmascript 6, the next 
version of the language, does address this issue, with the addition of 
iniport and export statements. Ecmascript 6 will be getting browser 
support soon but as of late 2015 you need a converter to Ecmascript 
5 such as Babel.js. Meanwhile, although there have been many 
attempts to create a reasonable client-side modular system none 
have really achieved critical mass and all are a little awkward to use. 
For now I would recommend using the well-established HTML 
script tag to include Scripts. So to include the D3 visualisation 
library you would add this tag to your main HTML file, convention- 
allyindex.html: 

<!D0CTYPE html> 

<pieta charset="utf-8"> 

<script src="http: //d3js.org/d3.v3.ntn. js"></scrlpt> 

You can include the script anywhere in your HTML file but ifs best 
practice to add them after the body (div tags etc.) section^ Note that 
the order of the script tags is important. If a script is dependent on a 
module, e.g. it uses the D3 library, its script tag must be placed 
after that of the module, i.e. big library Scripts, such as jQuery and 
D3 will be included first. 

Python comes with ‘batteries included’, a comprehensive set of libra- 
ries covering everything from extended data containers (collec 
tions) to working with the family of CSV files (csv). If you want to 
use one of these just import it using the irnport keyword: 

In [1]: import sys 

In [2]: sys.platform 
0ut[2]: 'linux2' 

If you dont want to import the whole library, want to use an alias 
etc., you can use the as and f rorn keywords instead: 

import pandas as pd 

from csv import DictWriter, DictReader 

from numpy import * O 

df = pd . read_json( 'data.json' ) 
reader = DictReader(' data.csv' ) 
md = medlan([12, 56, 44, 33]) 


7 This means any blocking script loading calls occur after the pages HTML has rendered. 
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O This imports all the variables from the module into the current 
namespace and is almost always a bad idea. One of the variables 
could mask an existing one and it goes against Python best- 
practice of explicit being better than implicit. One exception to 
this rule is if you are using the Python interpreter interactively. 
In this limited context it may make sense to import all functions 
from a library to cut down on key-presses, e.g. importing all the 
math functions (from math import *) if doing some Python 
math hacking. 

If you import a non-standard library, Python uses sys. path to try 
and flnd it. sys. path consists of: 

• the directory containing the importing module (current direc- 
tory) 

• the PYTHOPATH variable, containing a list of directories 

• the installation-dependent default, where libraries installed 
using pip or easy_install will usually be placed. 

Big libraries are often packaged, being divided into sub-modules. 
These sub-modules are accessed by dot-notation: 

import matplotlib.pyplot as plt 

Packages are constructed from the filesystem using ‘init.py’ files, 
usually empty, as shown in Example 2-1. The presence of an init file 
makes the directory visible to Pythons import system. 

Example 2-1. Building a Python package 

mypackage 
I— _init_.py 

I— core 

I '— init .py 


_init_.py 

apl.py 

tests 

'— init .py 

'— test_data.py 
'— test_excel.py O 
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O This module would be imported using fron rnypack 
age.io.tests Inport test_excel. 

Packages on sys.path can be accessed from the root directory 
(thats mypackage in Example 2-1) using dot-notation. A special case 
of Irnport is intra-package references. The test_excel.py submodule 
in Example 2-1 can irnport submodules from the mypackage pack- 
age both absolutely and relatively: 

fron mypackage.io.tests inport test_data O 
fron inport test_data & 
inport test_data 0 
fron .io inport api & 

O Imports the test_data.py module absolutely, from the packages 
head-directory. 

An explicit (‘. irnport’) and implicit relative irnport. 

© A relative irnport from a sibling package of tests. 

Keeping your namespaces clean 

The variables deflned in Python modules are encapsulated, which 
means that unless you irnport them explicitly, e.g. fron foo inport 
baa, you will be accessing them from the imported module’s name- 
space using dot notation, e.g. foo.baa. This modularisation of the 
global namespace is quite rightly seen as a very good thing and plays 
to one of one of Pythons key tenets, the importance of explicit state- 
ments over implicit. When analysing someone’s Python code it 
should be possible to see exactly where a class, function or variable 
has come from. Just as importantly, preserving the namespace limits 
the chance of conflicting or masking variables - a big potential prob- 
lem as code-bases get larger. 

One of the main criticisms of JavaScript, and a fair one, is that it 
plays fast and loose with namespace conventions. The most egre- 
gious example of this is that variables declared outside of functions 
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or missing the ‘var’ keyword* are global rather than confined to the 
script in which they are declared. There are various ways to rectify 
this situation but the one I use and recommend is to make each of 
your Scripts a self-calling function. This makes all variables declared 
using var local to the script/function, preventing them polluting the 
global namespace. Any objects, functions, variables etc. you want to 
make available to other Scripts can be attached to an object which is 
part of the global namespace. 

Example 2-2 demonstrates a module-pattern. The boderplate head 
and tail (labelled 1. and 3.) effectively create an encapsulated mod¬ 
ule. This pattern is far from a perfect solution to modular JavaScript 
but is the best compromise 1 know until Ecmascript 6 and a dedica- 
ted import System becomes Standard. One obvious disadvantage is 
that the module is part of the global namespace, which means, 
unlike in Python, there is no need to explicitly import it. 

Example 2-2. A module pattern for JavaScript 

(functlon(nbvlz) { O 
'use striet' ; 

// ... 

nbviz.updateTlneChart = function(data) { 

// ... 

}(window. nbviz = window.nbviz || {})); & 

O Receives the global nbviz object. 

O Attaches the updateTinieChart method to the global nbviz 
object, effectively exporting it. 

€) If an nbviz object exists in the global (window) namespace pass 
it into the module function, otherwise add it to the global 
namespace. 

Outputting'Helio World' 

By far the most popular initial demonstration of any programming 
language is getting it to print or communicate ‘Helio World!’ in 


8 This possibility of a missing Var’ can be removed by using the Ecmascript 5 use striet 
directive. 
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some form, so lets start with getting output from Python and Java¬ 
Script. 

Pythons output couldnt be much simpler but version 3 sees a 
change to the print statement, making it a proper function^ 

# In Python 2 
print 'Helio World!' 

# In Python 3 
printC Helio World!') 

You can use Python 3’s print function in Python 2 by importing it 
from the future module: 

fron _future_ inport print_functlon 

If youre not using Python 3 then this is a sensible approach. The 
new print function is here to stay and ifs best to get used to it now. 

JavaScript has no print function but you can log output to the 
browser console: 

console.log('Helio World!); 

Simple data-processing 

A good way to get an overview of the language differences is to see 
the same function written in both. Example 2-3 and Example 2-4 
show a small, contrived example of data-munging in Python and 
Javscript respectively. We’ll use these to compare Python and JS syn- 
tax. 


Example 2-3. Simple Data-munging with Python 

fron _future_ inport print_function 

# A 

student_data = [ 


{ 

nane' : 

'Bob' , 

id':0, 'scores' : [68, 75, 

56, 81]}, 

{ 

nane' : 

'Alice' 

'id' : 1, 'scores' : [75, 

90, 64, 88]}, 

{ 

nane' : 

'Carol' 

'id':2, 'scores' :[59, 74, 71, 68]}, 

{ 

nane' : 

'Dan' , 

id':3, 'scores ': [64, 58, 

53, 62]}, 


# B 


9 This is a good thing for reasons outlined in PEP 3105 here. 
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def process_student_data(data, pass_threshold=60, 
pierlt_threshold=75) : 

Perforn sone basic stats on sone student data. 


for sdata in data: 

av = sun(sdata[ 'scores' ] )/float(len(sdata[ 'scores' ])) 
sdata[ 'average' ] = av 

if av > merlt_threshold : 

sdata[ ' assesspient' ] = 'passed with merit' 
elif av > pass_threshold : 

sdata[ 'assessment' ] = 'passed' 

else: 

sdata[ ' assesspient' ] = 'falled' 

# D 

print("%s's (id: %d) flnal assessnent is: %s"%( 

sdata[ 'name' ], sdata['id'], sdata[ 'assessnent ']. upper())) 


# E 

if _nane_ == '_nain_ ': 

process_student_data(student_data) 

Example 2-4. Simple data-munging with JavaScript 

// A (note deliberate and valid inconsistency in keys (sane quoted 
// and sane unquoted) 
var studentData = [ 

{nane: 'Bob', id:0, 'scores ': [68, 75, 76, 81]}, 

{nane: 'Alice', id:l, 'scores ': [75, 90, 64, 88]}, 

{'nane': 'Carol', id:2, 'scores ': [59, 74, 71, 68]}, 

{'nane': 'Dan', id:3, 'scores ': [64, 58, 53, 62]}, 

]; 

// 8 

function processStudentData(data, passThreshold, neritThreshold){ 

passThreshold = typeof passThreshold !== 'undefined'? passThreshold: 60; 
nerltThreshold = typeof nerltThreshold !== 'undefined'? neritThreshold : 75; 

// C 

data.for Each(function( sdata ){ 

var av = sdata . scores . reduce(function(prev, current){ 
return prev+current; 

},0) / sdata.scores.length; 
sdata.average = av; 

if(av > neritThreshold){ 

sdata.assessnent = 'passed with nerit'; 

} 

else if(av > passThreshold){ 

sdata.assessnent = 'passed'; 
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} 

else{ 

sdata.assessnent = 'failed'; 

} 

// D 

console.log(sdata.nane + '"s (Id: " + sdata.ld + 
") flnal assesspient is: " + 
sdata . assessnent . toUpperCase( )); 


} 

// E 

processStudentData(studentData); 

String construction 

Section D in Example 2-4 and Example 2-3 show the Standard way 
to print output to console or terminal. JavaScript has no print state- 
ment but will log to the browsers console through the console 
object. 

console.log(sdata.nane + "'s (id: " + sdata.id + 

") final assessnent is: " + sdata.assessnent.toUpperCaseO); 

Note that the integer variable id is coerced to a string, allowing con- 
catenation. Python doesnt perform this implicit coercion so 
attempting to add a string to an integer in this way would give an 
error. Instead explicit conversion to string form is achieved using 
one ofthe str or repr functions. 

In Section A Example 2-3 the output string is constructed using C 
type formatting. String (%s) and integer (%d) place-holders are pro- 
vided by a final tuple (%(...)): 

print("%s's (id: %d) final assessnent is: %s" 

%(sdata[ 'nane' ], sdata['id'], sdata[ 'assessnent ']. upper() )) 

These days I rarely use Pythons print statement, opting for the 
much more powerful and flexible logging module, which is demon- 
strated in the following code-block. It takes a little more effort to use 
but it is worth it. Logging gives you the flexibility to direct output to 
a file and/or the screen, adjust the logging level to prioritise certain 
information and a whole load of other useful things. Check out the 
details here. 

inport logging 

logger = logging.getLogger(_nane_ ) O 
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//... 

logger.debugC Sone useful debugging output') 
logger . i.nfo( ' Sone general tnfornatlon' ) 

// IN INITIAL MODULE 

logglng . basicConfLg(level=logglng . DEBUG) o 

O Creates a logger with the name of this module. 

& You can set the logglng level, an output file as opposed to the 
default to screen etc.. 

Significant whitespace vs curiy brackets 

The syntactic feature most associated with Python is significant 
whitespace. Wheras languages like C and JavaScript use whitespace 
for readability and could easily be condensed into one line‘“, in 
Python leading spaces are used to indicate code-blocks and remov- 
ing them changes the meaning of the code. The extra effort required 
to maintain correct code alignment is more than compensated for 
by increased readability - you spend far longer reading than writing 
code and the easy reading of Python is probably the main reason 
why the Python library ecosystem is so healthy. Four spaces is pretty 
much mandatory (see PEP 8) and my personal preference is for 
what is know as soft tabs, where your editor inserts (and deletes) 
multiple spaces instead of a tab character" 

In the foUowing code, the indentation of the return statement must 
be four spaces by convention". 

def doubler(x) : 
return x * 2 

# j<-this spacing is inportant 

JavaScript doesnt care about the number of spaces between state- 
ments, variables etc.., using curly-brackets to demark code-blocks, 
the two doubler functions in this code being equivalent: 


10 this is actually done by JavaScript compressors to reduce the filesize of downloaded 
web-pages 

11 The soft vs hard tab debate generates controversy, with much heat and little light. PEP 8 
stipulates spaces, which is good enough for me. 

12 It could be two or even three spaces but this number must be consistent throughout the 
module. 
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var doubler = function(x){ 
return x * 2; 

} 

var doubler=function(x){return x*2;} 

Much is made of Pythons whitespace but most good coders I know 
set their editors up to enforce indented code-blocks and a consistent 
look and feel. Python merely enforces this good practice. And, to 
reiterate, I believe the extreme readability of Python code contrib- 
utes as much to Pythons supremely healthy ecosystem as its simple 
syntax. 

Comments and doc-strings 

To add comments to code, Python uses hashes #: 

# ex.py, a single infornative coment 
data = {} # Our nain data-ball 

By contrast, JavaScript uses the C language convention of double 
backslashes // ot /*... V for multi-line comments: 

// script.js, a single infornative coment 
/* A nulti-line coment block for 
function descriptions, library script 
headers and the like */ 
var data = {}; // Our nain data-ball 

As well as comments, and in keeping with its philosophy of read¬ 
ability and transparency, Python has documentation strings (doc- 
strings) by convention. The process_student_data function in 
Example 2-3 has a triple-quoted line of text at its top which will 

automaticaUy be assigned to the functions_doc_ attribute. You 

can also use multi-line docstrings. 

def doubler(x) : 

"""This function returns double its input.""" 

return 2 * x 

def santtlze_strlng(s) : 

"""This function replaces any string spaces 
with after stripping any uhite-space 

return s.strlp() . replace(' 

Docstrings are a great habit to get into, particularly if working col- 
laboratively. They are understood by most decent Python editing 
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toolsets and are also used by such automated documentation libra¬ 
rias as Sphinx. The string-literal docstring is accessible as the doc 
property of a function or class. 

Declaring variables, var 

In Section A of Example 2-3 and Example 2-4 the declaration of the 
student data requires a var keyword for JavaScript. We could dis¬ 
pense with the va r and the script would run fine but we would be in 
danger of being skewered by JS gotcha number one: any variables 
declared without va r are attached to the global namespace, or win- 
dow object, which means they can easily mask or be masked by any 
other variables sharing the same name. This possibility of name¬ 
space pollution is a big problem for JS and the reason why you 
should get a good linter to warn of missing vars. You should also 
use Ecmascripfs use striet directive to force aU variables to be 
declared with var (see “Style guidelines, PEP 8 and ‘use stricf” on 
page 37). 

Strictly speaking JS statements should be terminated with a semi- 
colon as opposed to Pythons new line. You will see examples where 
the semi-colon is dispensed with and modern browsers wiU usually 
do the right thing here but there are risks involved (e.g. it can trip up 
code minifiers and compressors which remove white-space). Tm in 
the semi-colon camp but many smart people seem to make do 
without them. 



Declare aU variables to be used in a function at 
its top. Javscript has variable hoisting, which 
means variables are processed before any other 
code. This means declaring them anywhere in 
the function is equivalent to declaring them at 
the top. This can resuit in weird errors and con- 
fusion. Explicitly placing vars at the top avoids 
this. 

Stringsand numbers 

The name strings used in the student-data (see Section A of 
Example 2-3 and Example 2-4) will be interpreted as UCS-2 (the 
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parent of Unicode UTF-16) in JavaScript'^ a string of bytes in 
Python 2 and Unicode {UTF-8 by default) in Python 3'^ 

Both languages allow single and double quotes for strings. If you 
want to include a single or double quote in the string then enclose 
with the alternative, like so: 

pub_nane = "The Brewer's Tap" 

The scores in Section A Example 2-4 are stored as JavaScripfs one 
numeric type, double-precision 64-bit (IEEE 754) floating-point 
numbers. Although JavaScript has a parseint conversion function, 
when used with floafs'^ it is really just a rounding operator, similar 
to floor. The type of the parsed number is stili number: 

var X = parselnt(3.45); // 'cast' x to 3 
typeof(x); // "number" 

Python has three numeric types, the 32 bit Int, to which the student 
scores are cast, a float equivalent (lEE 754) to JS’s number and a 
long for arbitrary precision integer arithmetic. This means Python 
can represent any integer whereas JavaScript is more limited"’. 
Pythons casting changes type: 

foo = 3.4 # type(foo) -> float 
bar = lnt(3.4) # type(bar) -> int 

The nice thing about Python and JavaScript numbers is that they are 
easy to work with and usually do what you want. If you need some- 
thing more efficient, Python has the Numpy library which allows 
fine-grained control of your numeric types (you’11 learn more about 
Numpy in ???. In JavaScript, aside from some cutting edge projects, 
youre pretty much stuck with 64 bit floats. 


13 The quite fair assumption that JavaScript uses UTF-16 has been the cause of much bug- 
driven misery. See here for an interesting analysis. 

14 The change to Unicode strings in Python 3 is a big one. Given the confusion that often 
attends Unicode de/encoding its worth reading a little bit about it: https:// 

docs.python.org/3/howto/unicode.html 

15 parseint can do qutie a bit more than round. For example parselnt(i2.5px) gives 12, 
first removing the px and then casting the string to a number. It also has a second radix 
argument to specify the base of the cast. See here for the specifics. 

16 With very large numbers JavaScript can get very weird, with non-continuous integer. 
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Booleans 

Python differs from the JavaScript and the C class languages in using 
named booleans operators. Other than that they work pretty much 
as expected. This table gives a comparison: 


Python bool True False not and or 
JavaScript boolean true false I && $$ 

Pythons capitalised True and False is an obvious trip up for any Jav- 
aScripter and vice-versa but any decent syntax-highlighting should 
catch that as should your code-linter. 

Rather than always returning boolean true or false, both Python and 
JavaScript and/or expressions return the resuit of one of the argu- 
ments, which may of course be a boolean value. The following table 
shows how this works, using Python to demonstrate: 


Table 2-1. Pythons’ boolean operators 


1 Operatiori 

Resuit 1 

xory 

Ifx Is false, theny, eisex 

xandy 

Ifx Is false, tbenx, eisey 

notx 

Ifx Is false, tben True, eise False 


This fact allows for some occasionally useful variable assignments 
rocket_taunch = True 

(rocket_launch == True and 'Alt OK') or 'We have a probten!' 

Out: 

'Alt OK' 

rocket_taunch = False 

(rocket_launch == True and 'Alt OK') or 'We have a probten!' 

Out: 

'We have a probten!' 

Data containers: dicts, objects, lists, arrays 

Roughy speaking, Javscript objects can be used like Python dicts 
and Python lists like JavaScript arrays. Python also has an tuple 
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Container, which functions like an immutable list. Heres some 
examples: 

# Python 

d = {'nane': 'Groucho', 'occupation' : 'Ruler of Freedonta'} 
l = ['Harpo', 'Groucho', 99] 
t = ('an', 'innutable', 'Container') 

// JavaScript 

d = {'natne': 'Groucho', 'occupation': 'Ruler of Freedonia'} 
l = ['Harpo', 'Groucho', 99] 

As shown in Section A Example 2-3 and Example 2-4, while 
Pythons dict keys must be quote-marked strings (or hashable types), 
JavaScript allows you to omit the quotes if the property is a valid 
identifier, i.e. not containing special characters such as spaces, 
dashes etc.. So in our studentData objects JS implicitly converts the 
property name to string form. 

The student data declarations look pretty much the same and, in 
practice, are used pretty much the same too. The key difference to 
note is that whde the curly-bracketed containers in the JS student 
Data look like Python dicts, they are actually a shorthand declara- 
tion of JS objects, a somewhat different data-container. 

In JS data-visualisation we tend to use arrays of objects as the chief 
data-container and here JS objects function much as a Pythonista 
would expect. In fact, as demonstrated in the following code, we get 
the advantage of both dot notation and key-string access, the former 
being preferred where applicable (keys with spaces, dashes etc. 
needing quoted strings): 

var foo = {bar:3, baz:5}; 
foo.bar; // 3 

foo['baz']; // 5, sane as Python 

Ifs good to be aware that though they can be used like Python dic- 
tionaries, JavaScript objects are much more than just containers 
(aside from primitives like strings and numbers pretty much every- 
thing in Javscript is an object)^^. But in most dataviz examples you 
see, they are used very much like Python dicts. 

Heres a little table to convert basic list operations: 


17 This makes iterating over their properties a little trickier than it might be. See here for 
more details. 
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Table 2-2. Lists and arrays 


1 JavaScript array (a) 

Python list (1) 1 

a.length 

len(l) 

a.push(item) 

l.append(item) 

a.popO 

l.popO 

a.shiftO 

l.pop(O) 

a.unshift(item) 

l.insert(0, item) 

a.slice(start, end) 

l[start:end] 

a.splice(start, howMany, il,... 

,) l[start:end] = [il,...] 


Functions 

Section B of Example 2-3 and Example 2-4 shows a function decla- 
ration. Python uses defio indicate a function: 

def process_student_data(data, pass_threshold=60, 
merit_threshold=75) : 

""" Perforn sone basic stats on sone student data. """ 


Whereas JavaScript nses function: 

function processStudentData(data, passThreshold, pieritThreshold){ 

passThreshold = typeof passThreshotd !== 'undeftned'? passThreshold: 60; 
merltThreshold = typeof merttThreshold !== 'undeftned'? meritThreshold: 75; 

} 

Both have a list of parameters. With JS the function codeblock is 
indicated by the curly brackets { ... }, with Python the code-block is 
defined by a colon and indentation. 

JS has an alternative way of defining a function, the function expres- 
sion, which you may see in examples: 

var processStudentData = function( ...){ 


52 I Chapter 2: A LanguageLearningBridgeBetween Python and JavaScript 






The differences are subtle enough not to worry now'*. For what its 
worth, I tend to use function expressioris pretty much exclusively. 

Function parameters is an area where Pythons handling is a deal 
more sophisticated than JavaScripts. As you can see in process_stu 
dent_data (Section B Example 2-3), Python allows default argu- 
ments for the parameters. In JavaScript ali parameters not used in 
the function call are declared undefined. In order to set a default 
value for these we have to perform a distinctly hacky conditional 
(ternary) expression: 

function processStudentData(data, passThreshold, meritThreshold){ 

passThreshold = typeof passThreshold !== 'undefined'? passThreshold: 60; 


The good news for JavaScripters is that the latest version of Java¬ 
Script, based on Ecmascript 6 and coming very soon allows Python- 
like default parameters: 

function processStudentData(data, passThreshold = 60, neritThreshold = 75){ 


Iterating: for loops and functional alternatives 

Section C Example 2-4 and Example 2-3 shows our flrst major 
departure, demonstrating JavaScripfs functional chops. 

Pythons for loops are simple, intuitive and work on any iterator^^, 
for example arrays and dicts. One gotcha with dicts is that Standard 
iteration is by key, not items. For example: 

foo ={'a':3, 'b':2} 
for X in foo: 
print(x) 

# outputs 'a' 'b' 

To iterate over the key-value pairs, use the dicfs items method like 
so: 


for X in foo.itens(): 
print(x) 

# outputs key-value tuples ('a', 3) ('b' 2) 


18 For the curious, theres a nice summation here. 

19 see below for generators and pseudo containers 
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You can assign the key/values in the for statement for convenience. 
For example: 

for key, value in foo.ltems(): 

Because Pythons for loop works on anything with the correct itera¬ 
tor plumbing, you can do cool things like loop over file lines: 

for line in open( 'data.txt' ): 
print(line) 

Corning from Python, JSs for loop is a pretty horrible, unintuitive 
thing. Heres an example: 

for(var i in ['a', 'b', 'c']){ 
console.log(l) 

} 

# outputs 1, 2, 3 

JSs for .. in returns the index of the arrays items, not the items 
themselves. To compound matters, for the Pythonista, the order of 
iteration is not guaranteed, so the indexes could be returned in non- 
consecutive order. 

Even iterating over an object is trickier than it might be. Unlike 
Pythons dicts, objects could have inherited properties from the pro- 
totyping chain so you should use a hasOwnProperty guard to fllter 
these out, like so: 

var obj = {a:3, b:2, c:4}; 
for (var prop in obj) { 

if( obj . hasOwnProperty ( prop ) ) { 

console.log("o. " + prop + " = " + obj[prop]); 

} 

} 

// out: o.a = 3, o.b - 2, o.c = 4 

Shifting between Python and JS for loops is hardly seamless, 
demanding you keep on the ball. The good news is that you hardly 
need to use JS for-loops these days. In fact, I almost never find the 
need. Thafs because JS has recently acquired some very powerful 
flrst-class functional abilities, which have more expressive power, 
less scope for confusion with Python and, once you get used to 
them, quickly become indispensable^®. 


20 this is one area where JS beats Python hands-down and which finds many of us wishing 
for similar functionality in Python. 
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Section C Example 2-4 demonstmtes forEach(), one of the functional 
methods available to modern JavaScript arrays^^ forEachQ iterates 
over the arrays items, sending them in turn to an anonymous call- 
back function defined in the first argument where they can be pro- 
cessed. The true expressive power of these functional methods 
comes from chaining them (maps, fllters etc.) but already we have a 
cleaner, more elegant iteration with none of the awkward book- 
keeping of old. 

The callback function receives index and the original array as 
optional second argument 

data.forEach (function (currentValue, index ){//— 

Whereas JS arrays have a set of native functional iterator methods 
(map, reduce, filter, every, sum, reduceRight), Objects -in their 
guise as pseudo-dictionaries- don’t. If you want to iterate over 
Object key-value pairs then Td recommend using underscore^^, the 
most used functional library for JS and almost as ubiquitous as 
JQuery. Underscore methods are accessed with the shorthand _, like 
this: 


_.each(obj, function(value, key){ 

// do sonething with the data.. 

This does introduce a library dependency but this type of iteration is 
very common in data-visualisation work and underscore has lots of 
other goodies. Along with JQuery it has pretty much honorary JS 
standard-library status. 

Conditionals: if, eise, elif, switch 

Section C Example 2-3 and Example 2-4 shows Python and Java¬ 
Script conditionals in action. Aside from JavaScripfs bracket fetish, 
the statements are very similar, the only real difference being 
Pythons extra elif keyword, a convenient conjunction of else if. 

Though much requested, Python does not have the switch statement 
found in most high-level languages. JS does, aUowing you to do this: 

switch (expresston){ 
case valuel: 


21 Added with Ecmascript 5 and available on all modern browsers. 

22 I use lodash, which is functionally identical 
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// execute if expressiori === valuel 
break; // optional end expressiori 
case valueZ: 

//... 

default: 

// if other natches fail 

File input and output 

JavaScript has no real equivalent of file input and output (I/O) but 
Python s is as simple as could be: 

# READINC A FILE 

f = open( "data.txt") # open file for reading 

for line in f: # iterate over file-lines 
print(line) 

lines = f.readlinesO # grab ali lines in file into array 
data = f.readO # read ali of file as single string 

# URITINC TO A FILE 

f = open("data .txt" , 'w') # use 'iv' to h/rite, 'a' to append to file 
f.write( "this will be written as a line to the file") 
f.closeO # explicitly close the file 

One much recommended best-practice is to use Pythons with, as 
context manager when opening files. This ensures they are closed 
automatically when leaving the block, essentially providing syntactic 
sugar for a try, except, finally block. Heres how to open a file using 
with, as: 

with open("data.txt") as f: 
lines = f.readlinesO 


Classes and prototypes 

Possibly the cause of more confusion that any other topic is JavaS- 
cripfs choice of prototypes rather than classical classes as its chief 
Object Orientated Programming (OOP) element. I have come to 
appreciate the concept of prototypes, if not its JS implementation, 
which could have been cleaner. Nevertheless, once you get the basic 
principle you may find that it is actually a better mental model for 
much of what we do as programmers than classical OOP paradigms. 

I remember, when I flrst started my forays into more advanced lan- 
guages like C++, falling for the promise of OOP, particularly class- 
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based inheritance. Polymorphism was ali the rage and Shape classes 
were being sub-classed to rectangles and ellipses, which were in turn 
subclassed to more specialised squares and circles. 

It didnt take long to realise that the clean class divisions found in 
the text-books were rarely found in real programming and that try- 
ing to balance generic and specific APIs quickly became fraught. In 
this sense, I find composition and mix-ins much more useful as a 
programming concept than attempts at extended subclassing and 
often avoid all these by using functional programming techniques, 
particularly in JavaScript. Nevertheless, the class/prototype distinc- 
tion is an obvious difference between the two languages and the 
more you understand its nuances the better you’11 code^^ 

Pythons classes are fairly simple affairs and, like most of the lan- 
guage, easy to use. I tend to think of them these days as a handy way 
to encapsulate data with a convenient API and rarely extend sub¬ 
classing beyond one generation. Heres a simple example: 

class Citizen(object) : 

def _lntt_ (self, nane, country): O 

self.name = name 
self. country = country 

def prtnt_detatls(self ): 

print( 'Citizen %s fron %s '%(self. name, self .country)) 


c = Citizen! 'Groucho M-', 'Freedonia') @ 
c . print_details( ) 

Out: 

Citizen Groucho M. fron Freedonia 

O Python classes have a number of double-underscored special 

methods,_init_being the most common, called when the 

class instance is created. All instance methods have a first, 
explicit se/f argument (you could name is something else but ifs 


23 I mentioned to a talented programmer friend that I was faced with the challenge of 
explaining prototypes to Python programmers and he pointed out that most JavaScrip- 
ters could probably do with some pointers too. Theres a lot of truth in this and many 
JSers do manage to be productive by using prototypes in a classy way, hacking their way 
around the edge-cases. 
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a very bad idea) which refers to the instance. In this case we use 
it to set name and country properties. 

o Creates a new Citizen instance, initialised with name and coun¬ 
try. 

Python follows a fairly classical pattern of class inheritance. Its easy 
to do, which is probably why Pythonistas make a lot of use of it. Lefs 
customise the Citizen class to create a (Nobel) Winner class with a 
couple of extra properties: 

class Winner(Cttlzen) : 

def _init_ (self, nane, country, category, year): 

super(Winner, self). _init_(nane, country) O 

self .category = category 
self. year = year 

def prtnt_detatls(self ): 

print(' Nobel winner %s fron %s, category %s, year %s'\ 
%(self.nane, self .country, self .category, str(self .year))) 


w = WlnnerC 'Albert E.', 'Switzerland' , 'Physics', 1921) 
w . print_details() 

Out: 

Nobel prlze-winner Albert E. fron Switzerland, category Physics, 
year 1921 

O We want to reuse the super-class Citizen’s_init_method, 

using this Winner instance as self. The super method scales the 
inheritance tree one branch from its first argument, supplying 
the second as instance to the class-instance method. 

I think the best article I have read on the key difference between Jav- 
aScripfs prototypes and classical classes is Reginald Braithwaites 
“OOP, JavaScript, and so-called Classes”. This quote sums up the 
difference between classes and prototypes as nice as any IVe found: 

“The difference between a prototype and a class is similar to the 
difference between a model horne and a blueprint for a horne.” 

When you instantiate a C-f-f or Python class, a blueprint is followed, 
creating an object and calling its various constructors in the inheri¬ 
tance tree. In other words you start from scratch and build a nice, 
pristine new class instance. 
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With JavaScript prototypes you start with a model home (object) 
which has rooms (methods). If you want a new living room you can 
just replace the old one with something in better colors etc. If you 
want a new conservatory then just make an extension. But rather 
than building from scratch with a blueprint, youre adapting and 
extending an existing object. 

With that necessary theory out of the way and the reminder that 
object-inheritance is useful to know but hardly ubiquitous in data- 
viz, lets see a simple JavaScript prototype example, Example 2-5. 

Example 2-5. A simple Javscript object 

var Citizen = function(napie, country){ O 
this.name = name; & 
this.country = country; 

}; 


Citizen.prototype = { e> 
printDetails : function(){ 

console.log( 'Citizen ' + this.name + ' from ' + this.country); 

} 

}; 


var c = new Citizen( 'Groucho M.', 'Freedonia' ); O 

c.printDetailsO; 

Out: 

Citizen Groucho M. from Freedonia 

typeof(c) # object 

O Javscript has no classes^'' so object-instances are built from func- 
tions or objects. 

@ this is an implicit reference to the calling context of the func- 
tion. For now it behaves as you would expect but though it 
looks a little like Pythons self the two are quite different, as 
we’11 see. 


24 As of Ecmascript 6 this will change with the addition of the class keyword, a piece of 
syntactic sugar generating a lot of heat and not much light right now. 
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€) The methods specified here will both override any prototypical 
methods up the inheritance chain and be inherited by any 
objects derived from Citizen. 

O new is used to create a new object, set its prototype to the Citi 
zen function and then call 


seifvs this 

It would be easy enough, at first glance, to assume that Pythons 
self and Javscripts this are essentially the same, the latter being an 
implicit version of the former, which is supplied to all class instance 
methods. But actually this and self are significantly different. Lefs 
use our bi-lingual Citizen class to demonstrate. 

Pythons self is a variable supplied to each class method, (you can 
call it anything you like but ifs not advisable) representing the class 
instance. But this is a keyword that refers to the object calling the 
method. This calling object can be different from the methods 
object instance and JavaScript provides the call, bind and apply 
function methods to allow you to exploit this fact. 

Lets use the call method to change the calling object of a 
print_details method and therefore the reference for this, used 
in the method to get the Citizens name: 

var groucho = new Cltizen( 'Croucho M.', 'Freedonla' ); 
var harpo = new Cltlzen(' Harpo M.' , 'Freedonla'); 

groucho.prlnt_detalls.call(harpo) ; 

Out: 

"Citizen Flarpo M. from Freedonla" 

So JavaScripts this is a much more maUeable proxy than Pythons 
self, offering more freedom but also the responsibility of tracking 
calling context and, should you use it, making sure new is always 
used in creating objects^l 


1 included Example 2-5 which shows new in JavaScript object instan- 
tiation because you will run into its use a fair deal. But the syntax is 


25 This is another reason to use Ecmascript 5’s use striet; injunction, which calls attention 
to such mistakes. 
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already a little awkward and gets quite a bit worse when you try to 
do inheritance. Ecmascript 5 introduced the Object.create 
method, a better way to create objects and to implement inheritance. 
Id recommend using it in your own code but new will probably crop 
up in some third party libraries. 

Lets use Object.create to create a Citizen and its Winner inheri- 
tor. To emphasise, Javscript has many ways to do this but 
Example 2-6 shows the cleanest I have found and my personal pat- 
tern. 


Example 2-6. Prototypical inheritance with Object.create 

var Citizen = { O 

setCitlzen: function(nane, country){ 
this. nane = nane; 
this.country = country; 

return this; 

}, 

printDetails : function(){ 

console.logC Citizen ' + this. nane + ' fron + this.country) 

} 

}; 


var Winner = Object. create(Citizen); 

Winner.setWinner = function(nane, country, category, year){ 
this . setCitizen(nane, country); 
this. category = category; 
this.year = year; 
return this; 


Winner.printDetails = function(){ 

console.logC Nobel winner ' + this. nane + ' fron ' + 
this.country + category ' + this. category + year ' + 
this.year); 


var albert = Object. create(Winner) 

.setWinner( 'Albert Einstein', 'Switzerland' , 'Physics', 1921); 

albert.printOetailsO; 

Out: 

Nobel winner Albert Einstein fron Switzerland, category 
Physics, year 1921 
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O Citizen is now an Object rather than a constructor function. 
Think of this as the base-house for any new buildings such as 
Winner. 

To reiterate, prototypical inheritance is not seen that often in Java¬ 
Script data-viz, particularly the 800-pound gorilla D3, with its 
emphasis on declarative and functional patterns, with raw unencap- 
sulated data being used to stamp its impression on the web-page. 

The tricky class/prototype comparison concludes this section on 
basic syntactic differences. Now lefs look at some common patterns 
seen in dataviz work with Python and JS. 

Differences in Practice 

The syntactic differences between JS and Python are important to 
know and thankfully outweighed by their syntactic similarities. The 
meat and potatoes of imperative programming, loops, conditionals, 
data declaration and manipulation is much the same. This is all the 
more so in the specialised domain of data-processing and data- 
visualisation where the languages first class functions allow com¬ 
mon idioms. 

What follows is a less than comprehensive list of some important 
patterns and idioms seen in Python and JavaScript, from the per- 
spective of a data-visualiser. Where possible a translation between 
the two languages. 

Method chaining 

A common JavaScript idiom is method chaining, popularised by its 
most popular library, jQuery and much used in D3. Method chain¬ 
ing involves returning an object from its own method in order to 
call another method on the resuit, using dot-notation: 

var sel = d3 .select('#viz ' ) 

.attr( 'wldth' , '600px') O 
.attr( 'height' , '400px') 

. style( 'background' , 'Itghtgray' ); 

O The attr method returns the D3 selection that called it, which 
is then used to call another attr method. 
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Method chaining is not much seen in Python, which generally advo- 
cates one statement per line, in keeping with simplicity and readabil- 
ity 

Enumerating a list 

Often it s useful to iterate through a list while keeping track of the 
items index. Python has the very handy enumerate keyword for just 
this reason: 

nanes = ['Atice', 'Bob', 'Carot'] 

for i, n in enumerate(nanes) : 
print('9od; %s'?o(t, n)) 


Out: 

0: Atice 
1: Bob 
2: Carot 

JavaScripfs list methods, such as forEach and the functional map, 
reduce and fllter, supply the iterated item and its index to the 
callback function: 

var nanes = ['Atice', 'Bob', 'Carot']; 

nanes. forEach(function(n, i){ 
consote.tog(i + ': ' + n); 

}); 

Out: 

0: Atice 
1: Bob 
2: Carot 

Tupleunpacking 

One of the first cool tricks Python initiates come across uses tuple 
unpacking to switch variables: 

(a, b) = (b, a) 

Note that the brackets are optional. This can be put to more practi- 
cal purpose as a way of reducing the temporary variables, for exam- 
ple in a fibonacci function: 

def fibonacci(n) : 

X, y = 0, 1 

for i in range(n): 
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print(x) 

X, y = y, X + y 

If you want to ignore one of the unpacked variables, use an under- 
score: 

winner = 'Albert Einsteln', 'Physics', 1921, 'Swlss' 
name, natlonality = winner 

Tuple unpacking has a slew of use-cases. It is also a fundamental fea- 
ture of the language and not available in JavaScript. 

Collections 

One of the most useful Python ‘batteries’ is the collections module. 
This provides some specialised Container datatypes to augment 
Pythons Standard set. It has a deque, which provides a list-like Con¬ 
tainer with fast appends and pops at either end, an OrderedDict 
which remembers the order entries were added, a defaultdict, 
which provides a factory function to set the dictionarys default and 
a Counter Container for counting hashable objects, among others. I 
find myself using the last three a lot. Heres a few examples: 

froR collections inport Counter, defaultdict, OrderedDict 

Itens = ['F', 'C, 'C, 'A', 'B', 'A', 'C, 'E', 'F'] 

cntr = Counter(lteRs) 
print(cntr) 
cntr['C'] -=1 
print(cntr) 

Out: 

Counter({'C' : 3, 'A': 2, 'F': 2, 'B': 1, 'E': 1}) 

Counter({'A' : 2, 'C: 2, 'F': 2, 'B': 1, 'E': 1}) 

d = defaultdlct(lnt) O 

for Iten in Items: 
djltepi] += 1 & 


d 

Out: 

defaultdlct(<type 'int'>, {'A': 2, 'C: 3, 'B': 1, 'E': 1, 'F': 2}) 
OrderedDlct(sorted(d.itensC ), key=lanbda 1: 1[1])) e> 
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Out: 

OrderedDlct([('B', 1), ('E', 1), ('A', 2), ('F', 2), ('C, 3)]) O 

O Sets the dictionary default to an integer, value 0 by default. 

& If the item-key doesnt exist its value is set to the default of zero 
and 1 added to that. 

e> Gets the list of items in the dictionary d as key-value tuple pairs, 
sorts using the integer value and then creates an OrderedDict 
with the sorted list. 

O The OrderedDict remembers the (sorted) order of the items as 
they were added to it. 

You can get more details on the collectiori module from here. 

There is a recent JavaScript library that emulates the Python collec 
tions module. You can find it here. As of late 2015 it is a very new 
but impressive piece of work, worth checking out even if you just 
want to extend your JavaScript knowledge. 

If you want to replicate some of Pythons collectiori function using 
more conventional JavaScript libraries, underscore or its function- 
ally identical replacement lodash^^ are a good place to start. These 
libraries offer some enhanced functional programming Utilities. Let s 
take a quick look at these handy tools. 

Underscore 

Underscore is probably the most popular JavaScript library after the 
ubiquitous jQuery and offers a bevy of functional programming 
Utilities for the JavaScript dataviz programmer. The easiest way to 
use underscore is to use a content delivery network (CDN) to load it 
remotely (these loads will be cached by your browser, making things 
very efficient for common libraries), like so: 

<scrlpt src="https ://cdnjs.cloudflare.con/ajax/libs/ 

underscore.js/1.8.3/unde rscore-nin.js"></script> 


26 My personal choice for performance reasons. 
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Underscore has loads of useful functions. There is, for example, a 
countBy method which serves the same purpose as the Pythons col¬ 
lectioris Counter just discussed: 

var itens = ['F', 'C, 'C, 'A', 'B', 'A', 'C, 'E', 'F']; 

_.countBy(ttens) O 
Out: 

Object {F: 2, C: 3, A: 2, B: 1, E: 1} 

O Now you see why the library is called underscore. 

As we’ll now see, the inclusion in modern JavaScript of native func- 
tional methods (map, reduce, fllter) and a forEach iterator for 
arrays has made underscore slightly less indispensable but it stiU has 
some great Utilities to augment vanilla JS. With a little chaining you 
can produce extremely terse but very powerful code. Underscore 
was my gateway drug to functional programming in JavaScript and 
the idioms are just as addictive today. Check out underscores reper- 
toire of Utilities here. 

Lefs have a look at underscore in action, tackling a more involved 
task: 

journeys = [ 

{period: 'mornlng' , tlnes:[44, 34, 56, 31]}, 

{perlod: 'evenlng' , tlnes:[35, 33],}, 

{perlod: 'norntng' , tlnes:[33, 29, 35, 41]}, 

{perlod: 'evening' , tlnes:[24, 45, 27]}, 

{perlod: 'mornlng' , tlnes:[18, 23, 28]} 

]; 


var groups = _.groupBy(journeys, 'perlod'); 
var mTlnes = _.pluck(groups[ 'nornlng' ], 'times'); 
mTlmes = _.flatten(mTlmes) ; O 
var average = function(l){ 

var sum = _.reduce(l, functlon(a,b){return a+b},0); 
return sum/l.length; 

}; 

console.log( 'Average mornlng time Is ' + average(mTlmes)); 

Out: 

Average mornlng time Is 33.81818181818182 

O Our array of morning times arrays ([[44, 34, 56, 31], [33...]]) 
needs to heflattened into a a single array of numbers. 
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Functional array methods and list comprehensions 

I find myself using underscore a lot less since the addition, with 
Ecmascript 5, of functional methods to JavaScript arrays. I dont 
think IVe used a conventional for-loop since then which, given the 
ugliness of JS for-loops, is a very good thing. 

Once you get used to processing arrays functionally, ifs hard to con- 
sider going back. Combined with JSs anonymous functions it makes 
for very fluid, expressive programming. Ifs also an area where 
method chaining seems very natural. Lefs look at a highly contrived 
example: 

var nuns = [1, 2, 3, 4, 5, 6, 1, 8, 9, 10]; 

var Sun = nums.fllter(function(o){ return o%2 }) O 
.nap(function(o){ return o * o}) 

. reduce(functlon(a, b){return a+b}); 

console.log( 'Sum of the odd squares is ' + sum); 

O Filters the list for odd numbers, i.e. returning 1 for the modulus 
(%) 2 operation. 

niap produces a new list by applying a function to each member, 
i.e. [1,3,5...] -> [1,9, 25...]. 

© reduce processes the resultant mapped list in sequence, provid- 
ing the current (in this case summed) value (a) and the item 
value (b). By default, the initial value of the first argument (a) is 
0 . 

Pythons powerful list comprehensions can emulate the example 
above easily enough: 

nuns = range(10) O 

odd_squares = [x * x for x in nums if x?S2] @ 
supi(odd_squares) © 

Out: 

165 

O Python has a handy built-in range keyword, which can also take 
a start, end and step, e.g. range(2, 8, 2) [2, 4, 6] 
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@ The If condition tests for oddness of x and any numbers pass- 
ing this filter are squared and inserted into the list. 

© Python also has a built in and often used sum statement. 


Pythons list comprehensions can use recursive 
control structures, applying a second for/if 
expression to the iterated items etc. Although 
this can create terse and powerful lines of code it 
goes against the grain of Pythons readability and 
I would discourage its use. Even simple list- 
comprehensions are less than intuitive and as 
much as it appeals to the leet hacker in all of us, 
you risk creating incomprehensible code. 

Pythons list comprehensions work weU for basic filtering and map- 
ping. They do lack the convenience of JavaScripfs anonymous func- 
tions (which are fully fledged, with their own scope, control blocks, 
exception handling etc.) but there are arguments against the use of 
anonymous functions. For example, they are not reusable and, being 
unnamed, they make it hard to follow exceptions and debug. See 
here for some persuasive arguments. Having said that, for libraries 
like D3, replacing the small, throw-away anonymous functions used 
to set DOM attributes and properties with named ones would be far 
too onerous and just add to the boilerplate. 

Python does have functional lambda expressions, which we’U look at 
in the next section, but for full functional processing in Python by 
necessity and Javscript for best-practice, we might use named func¬ 
tions to increase our control scope. For our simple odd-squares 
example named functions are a contrivance but note that they 
increase the flrst-glance readability of the list comprehension - 
much more important as your functions get more complex. 

def ts_odd(x) : 
return x%2 

def sq(x): 

return x * x 

sun([sq(x) for x in t if is_odd(x)]) 
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With JavaScript a similar contrivance can also increase readability 
and facilitate DRY^^ code: 

var isOdd = function(x){ return x%2; }; 
sun = l.fllter(lsOdd) 


Map, reduce and filter with Python's lambdas 

Although Python lacks anonymous functions it does have lambdas, 
nameless expressions which take arguments. While lacking the bells 
and whistles of JavaScripts anonymous functions these are a power- 
ful addition to Pythons functional programming repertoire, espe- 
cially when combined with its functional methods. 



Pythons functional budt-ins, map, reduce, filter 
methods and lambda expressions, have a cheq- 
uered past. It’s no secret that the creator of 
Python wanted to remove them from the lan- 
guage. The clamour of disapproval lead to their 
reluctant preservation. With the recent trend 
towards functional programming this looks like 
a very good thing. Theyre not perfect but far 
better than nothing. And given JavaScripfs 
strong functional emphasis theyre a good way 
to leverage skills acquired in that language. 


Pythons lambdas take a number of parameters and return an opera- 
tion on them, using a colon separator to define the function block, 
in much the same way as Standard Python functions only pared to 
the bare essentials and with an implicit return. The foUowing exam- 
ple shows a few lambdas employed in functional programming: 

fron functools inport reduce # if using Python 3+ 

nuns = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 

odds = fllterflanbda x: x % 2, nuns) 
odds_sq = map(lanbda x: x * x, odds) 
reduceflanbda x, y: x + y, odds_sq) O 
Out: 

165 


27 Don’t repeat yourself being a solid coding convention. 
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O Here the reduce method provides two arguments to the lambda, 
which uses them to return the expressiori after the colon. 

JavaScript closures and the module-pattern 

One of the key concepts in JavaScript is that of the closure, essen- 
tially a nested function declaration which uses variables declared in 
an outer (but not global) scope which are kept alive after the func¬ 
tion is returned. Closures allow for a number of very useful pro- 
gramming patterns and are a common feature of the language. 

Lefs look at possibly the most common usage of closures and one 
weVe already seen exploited in our module pattern (Example 2-2): 
exposing a limited API while having access to essentially private 
member variables. 

A simple example of a closure is this little counter: 

function Counter(inc) { 
var count = 0; 
var add = functionO { O 
count += Inc; 

console.logCCurrent count: ' + count); 

} 

return add; 

} 

var inc2 = Counter(2); @ 
inc2(); © 

Out: 

Current count: 2 
tnc2(); 

Out: 

Current count: 4 

O The add function gets access to the essentially private, outer- 
scope count and inc variables. 

© This returns an add function with the closure-variables, count 
(0) and inc (2). 

© Calling inc2 caUs add, updating the closed count variable. 

We can extend the Counter to add a little API. This technique is the 
basis of JavaScript modules and many simple libraries. In essence it 
selectively exposes public method while hiding private method and 
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variables, generally seen as good practice in the programming 
World: 

function Counter(lnc) { 
var count = 0; 
var api = {}; 
api.add = function() { 
count += inc; 

console.logCCurrent count; ' + count); 

} 

api.sub = functionO { 
count -= inc; 

consoie.iog( 'Current count; ' + count) 

} 

api.reset = function() { 
count = 0; 

consoie.iogC Count reset to 0') 

} 

return api; 

} 

cntr = Counter(3); 
cntr.addO // Current count: 3 
cntr.addO // Current count: 6 
cntr.subO // Current count: 3 
cntr.resetO // Count reset to 0 

Closures have ali sorts of uses in JavaScript and Id recommend get- 
ting your head around them - you’11 see them a lot as you start 
investigating other peoples code. These are three particularly good 
web-articles that provide a lot of good use-cases for closures: 1, 2, 3. 

Python has closures but they are not used nearly as much as JavaS- 
cripfs, perhaps because of a few quirks which, though surmounta- 
ble, make for some slightly awkward code. To demonstrate, 
Example 2-7 tries to replicate the previous JavaScript Counter. 


Example 2-7. Afirst-pass attempt at a Python counter closure 

def get_counter(inc) ; 
count = 0 
def add(): 

count += inc 

print( 'Current count: ' + str(count)) 
return add 

If you create a counter with get_counter (Example 2-7) and try to 
run it you’11 get an UnboundLocalError: 
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cntr = get_counter(2) 

cntrO 

Out: 


UnboundLocalError: local varlable 'count' referenced before 
asslgnnent 

Interestingly, although we can read the value of count within the add 
function (comment out the count += inc line to try it), attempts to 
change it throw an error. This is because attempts to assign a value 
to something in Python assume it is local in scope. There is no 
count local to the add function and so an error is thrown. 

In Python 3 we can get around the error in Example 2-7 by using 
the nonlocal keyword to teli Python that count is in a non-local 
scope: 

def add(): 

nonlocat count 
count += inc 

In Python 2 we can use a little dictionary hack to allow mutation of 
our closed variables: 

def get_counter(lnc) : 
vars = { 'count' : 0} 
def add(): 

vars[ 'count' ] += inc 

print( 'Current count: ' + str(vars[ 'count' ])) 
return add 

This hack works because we are not assigning a new value to vars 
but mutating an existing Container, perfectly valid even if it is out of 
local scope. 

As you can see, with a bit of effort, JavaScripters can transfer their 
closure skiUs to Python. The use-cases are similar but Python, being 
a richer language with lots of useful batteries included, has more 
options to apply to the same problem. Probably the most common 
use of closures is in Pythons decorators. 

Decorators are essentially function wrappers that extend the func- 
tions’ utility without having to alter the function itself Theyre a rel- 
atively advanced concept but you can find a user-friendly 
introduction here. 
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This is that 

One JavaScript hack you’11 see a lot of is a consequence of closures 
and the slippery this keyword. If you wish to refer to the outer- 
scoped this in a child function then you must use a proxy as the 
childs this will be bound according to context. The convention is to 
use that to refer to this. The code is less confusing than the explan¬ 
ationi 

function outer(bar){ 
this.bar = bar; 
var that = this; 
function inner(baz){ 

this.baz = baz * that.bar; O 
// ... 

O that refers to the outer functions this 

This concludes my cherry-picked selection of patterns and hacks I 
find myself using a lot in Dataviz work. You’U doubtless acquire 
your own but I hope these give you a leg up. 

A Cheatsheet 

As a handy reference guide, heres a set of cheat-sheets to translate 
basic operations between Python and JavaScript. 


JavaScript 

Python 

<script src=“lib/ vizUtils.js" > 

</script> 

import vizutils as viz 
from vizutils import gblur 

(function(foolib){ 

... // module pattern 

}(window.foolib = window.foolib || {})); 


var foo; // undefined variables 
var bar=20; 

bar = 20 

var foo = function(a, b){ 

// clunky defaults, fixed in ES6! 

def foo(a, b=10): 

X = a%b 

b = typeof b !== 'undefined' ? b : 10; 


var X = a%b; 

return resuit 

return resuit; 

}; 

— significant 
whitespace! 

Figure 2-2. Some basic syntax 
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JavaScript 

Python 

var X = false; 

X = False 

var y = true; 

y = True 

var l = [] 

t = [) 

if(!x && y === x){... 

if not X and y == x: 

ifd.length === 0){... 

if l: ... 

Figure 2-3. Booleans 

JavaScript 

Python 

camelCase vs 


underscored 

var studentData = [ 

studentdata = ( 

{'natne' : 'Bob', 

{'name': 'Bob', 

■scores':[68. 75, 56, 81]}, 

'scores':[68, 75, 56, 81]}, 

{name: 'Alice', 

{'name'; 'Alice', 

•scores‘:I75. 90, 64, 88]}, 

...]; 

'scores':[75, 90, 64, 88]}, 

...] 

anonyn^s functions 

line-break 

studentOata.forEach(function(sdata){ 

s data = student data / 

var av = sdata.scores 

for data in s data.itemsO: / 

.reduce(function(prev, current){ 

av = sum(data['scores'])\'^ 

return prev+current; 

/float(len(data[■scores'])) 

/ },0) / sdata.scores.length; 

sdata['average'] = av 

/ sdata.average = av; 


1 

first-class functional methods 


console.log(sdata.name + " scored " + 

print("%s scored %d"% 

sdata.average); 

{sdata.name, sdata.average)); 

while(i < 10){ 

while i < 10: 

> 

while True: 

do { 

if i >= 10: 

} 

while(i < 10); 

break 

Figure 2-4. Loops and iterations 
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JavaScript 

Python 

if(x === ' foo'){ 

else if(x === 'bar'){ 

...} 

else{ 

...} 

if X == 'foo': 

elif X == 'bar': 

else: 

if(x === foo && y !== bar){... 

if X == foo and y != bar: 

if ([ 'foo', 'bar', 'baz' ] 

.indexOf(s) != -1){... 

if s in ['foo', 'bar', 'baz']: 

switch(foo){ 
case bar: 


break; 

case baz: ... 
default; 

return false; 

} 



Figure 2-5. Conditionals 


JavaScript 

Python 

var l = [1, 2, 3, 4]; 
l.push('foo'); // [...4, 'foo'] 
i.popO; // 'foo', 4] 

l.sUce(l,3) // [2, 3] 
i.slice(-3, -1) // [2, 3] 

l = [1, 2, 3, 4] 

l.append{'foo’) # [...4, 'foo'] 

l.pop() # 'foo', l=[..., 4] 

IU:3] # [2, 3] 

l[-3:-l] # [2, 3] 

l[0:4:2] # [1, 3] (stride of 2) 

l.map(function(o){ return o*o;}) 

// [1, 4, 9, 16] 

[o*o for 0 in 1] 

// [1, 4, 9, 16] 

d = {a:l, b:2, c:3}; 
d.a == d['a’] // 1 
d.z // undefined 

d = {'a':!, 'b':2, 'c':3} 
d['a'] # 1 

d.get('z') # NoneType 
d[ 'z'] # KeyError! 

// OLD BROWSERS 
for(key in d){ 
if(d.hasOwnProperty(key){ 
var item = d[key); 

for key, value in d.itemsO: ... 
for key in d: 

for value in d.values{):... 

// NEW AND BETTER 

Object.keys(d).forEach(key, i){ 
var item = d[key]; 



Figure 2-6. Containers 
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JavaScript 

Python 

var Foo = { 

class Foo(object): 

initFoo: function(bar){ 

def init (self, bar): 

this.bar = bar; 

self.bar = bar 

return this; 


} 

class Baz(Foo): 

}; 

def init (self, bar. qux): 


suDer(Baz). init (bar) 

var Baz = Object.create{Foo); 

self.qux = qux 

Baz.initBaz = function{bar, qux){ 

baz = Baz('answer', 42) 

this.initFoo(bar); 

baz.bar # 'answer' 

this.qux = qux; 


return this; 

}; 


var baz = Object.create(Baz) 


.initBaz(‘answer’, 42); 



Figure 2-7. Classes andprototypes 


Summary 

I hope this chapter has shown that JavaScript and Python have a lot 
of common syntax and that most common idioms and patterns 
from one of the languages can be expressed in the other without too 
much fuss. The meat and potatoes of programming, iteration, con- 
ditionals, basic data manipulation etc. is simple in both languages 
and translation of functions straightforward. If you can program in 
one to any degree of competency, the threshold to entry for the 
other is low. Thafs the huge appeal of these simple scripting lan¬ 
guages, which have a lot of common heritage. 

I provided a list of patterns, hacks, idioms I find myself using a lot in 
dataviz work. Tm sure this list has its idiosyncrasies but IVe tried to 
tick the obvious boxes. 

Treat this as part-tutorial, part-reference for the chapters to come. 
Anything not covered here will be dealt with when introduced. 


76 I Chapter 2: A LanguagetearningBridgeBetween Python and JavaScript 




CHAPTER 3 


Reading and Writing Data with 

Python 


One of the fundamental skills of any data-visualiser is the ability to 
move data around. Whether your data is in an SQL database, a 
comma-separated-value (CSV) file or in some more esoteric form, 
you should be comfortable reading the data and being able to con- 
vert it and write it into a more convenient form if need be. One of 
Pythons great strengths is how easy it makes manipulating data in 
this way and the focus of this chapter is to bring you up to speed 
with this essential aspect of our Dataviz toolchain. 

This chapter is part tutorial, part reference and sections of it will be 
referred back in later chapters. If you know the fundamentals of 
reading and writing Python data you can cherry-pick parts of the 
chapter as a refresher. 

Easy Does It 

I remember when I started programming back in the day (using 
low-level languages like C) how awkward data manipulation was. 
Reading from and writing to files was an annoying mixture of 
boiler-plate code, hand-rolled kludges and the like. Reading from 
databases was equally difflcult and as for serialising data, the memo- 
ries are stiU painful. Discovering Python was a breath of fresh air. It 
wasnt a speed demon but opening a file was pretty much as simple 
as it could be: 

file = open( 'data.txt' ) 
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Back then Python made reading from and writing to files refresh- 
ingly easy and its sophisticated string-processing made parsing the 
data in those files just as easy. It even had an amazing module called 
Pickle that could serialise pretty much any Python object. 

In the years since, Python has added robust, mature modules to its 
Standard library which make dealing with CSV and JSON files, the 
Standard for web data-viz work, just as easy. There are also some 
great libraries for interacting with SQL-databases, such as SQLAl- 
chemy, my thoroughly recommended go-to. The newer NoSQL- 
databases are also weU served. MongoDB is by some way the most 
popular of these newer document-based databases and Pythons 
pymongo library which, demonstrated later in the chapter, makes 
interacting with it a relative breeze. 

Passing Data Around 

A good way to demonstrate how to use the key data-storage libraries 
is to pass a single data packet among them, reading and writing it as 
we go. This will give us an opportunity to see in action the key data 
formats and databases employed by data-visualisers. 

The data we’ll be passing around is probably the most commonly 
used in web-visualisations, a list of dictionary-like objects (see 
Example 3-1). This data-set would be transferred to the Browser in 
JSON form, which is, as we’ll see, easily converted from a Python 
dictionary. 


Example 3-1. Our target list ofdata objects 

nobet_wtnners = [ 

{'category': 'Physics', 

'narne': 'Atbert Elnstein', 

'nationattty' : 'Swiss', 

'sex' : 'nate' , 

'year' : 1921}, 

{'category': 'Physics', 

'nane' : 'Paul Dirae' , 

'nationality' : 'British', 

'sex' : 'nate' , 

'year': 1933}, 

{ 'category' : 'Chenistry' , 

'nane' : 'Marie Curie' , 

'nationality': 'Polish', 

'sex' : 'fenale' , 


78 I Chapter 3: Reading and Writing Data with Python 




year' : 1911} 


] 

We’ll start by creating a CSV file from the Python list shown in 
Example 3-1, as a demonstration of reading (opening) and writing 
System files. 

The foUowing sections assume youre in a work directory with a 
data sub-directory to hand. You can run the code from a Python 
interpreter or file. 

Working with System Files 

In this section we’11 create a CSV-file from a Python list of dictionar- 
ies (Example 3-1). Usually youd do this using the csv module, 
which we’11 demonstrate after this section, so this is just a way of 
demonstrating basic Python file-manipulation. 

First let’s open a new file, using w as a second argument to indicate 
we’U be writing data to it. 

f = open( 'data/nobel_winners.csv' , 'w') 

Now we’11 create our CSV file from the nobel_winners dictionary 
(Example 3-1): 

cots = nobel_wtnners[0] . keys() O 
cots . sortO O 

with open( 'data/nobet_wlnners.csv, 'w') as f: © 
f.wrlte( ',' . join(cols) + '\n') O 

for o in nobet_winners: 

row = [str(o[col]) for cot in cots] © 
f .writeC ,' . joln(row) + '\n') 

O Gets our data columns from the keys of the first object, i.e. “cat- 
egory, name, ...”. 

© Sorts the columns in alphabetical order. 

© Uses Pythons with statement to guarantee the file is closed on 
leaving the block or if any exceptions occur. 

O join creates a concatenated string from a list of strings (cois 
here), joined by the initial string, i.e. “categoryname,..”. 
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© Creates a list using the column keys to the objects in nobel_win 
ners. 

Now weVe created our CSV-file, lets use to Python to read it and 
make sure everything is correct: 

with open( 'data/nobel_winners.csv' ) as f: 
for line in f . readllnes( ): 

print(llne), O 


Out: 

category , nane,nationallty,sex,year 
Physics , Albert Elnstein , Swlss,nale , 1921 
Physics,Paul Dirac,Brltish,nate, 1933 
Chenistry,Marte Curie,Polish,fenate, 1911 

O Adding a , to the print statement inhibits the addition of an 
unnecessary new-line. 

As the previous output shows, our CSV-file is weU-formed. Lets use 
Pythons built-in csv module to first read it and then create a CSV- 
file the right way 

CSV, TSV and Row-column Data-formats 

Comma separated values (CSV) or their tab-separated cousins 
(TSV) are probably the most ubiquitous file-based data-formats and 
as a data-visualiser this will often be the forms you’11 receive to work 
your magic with. Being able to read and write CSV files and their 
various quirky variants, such as pipe or semi-colon separated or 
those using ' in place of the Standard double-quotes, is a fundamen- 
tal skill and Pythons csv module is capable of doing pretty much all 
your heavy-lifting here. Lefs put it through its paces reading and 
writing our nobel_winners data: 

nobel_winners = [ 

{'category': 'Physics', 

'nane': 'Albert Einstein', 

'nationality' : 'Swiss', 

'sex' : 'nate' , 

'year': 1921}, 

] 

Writing our nobel_winners data (see Example 3-1) to a CSV file is a 
pretty simple affair. csv has a dedicated DictWriter class which will 
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turn our dictionaries into csv rows. The only piece of explicit book- 
keeping we have to do is write a header to our csv-file, using the 
keys of our dictionaries as fields (i.e. “category, name, nationality, 
sex’): 

inport csv 

with open( 'data/nobel_wlnners.csv' , 'wb') as f: 
fieldnames = nobel_wtnners[0] . keys() O 
fleldnames . sort() @ 

wrlter = csv.DtctWrlter(f , fieldnanes=fieldnapies) 
writer .writeheaderO €> 
for w in nobel_winners : 
writer.writerow(w) 

<1> You need to explicitly teli the writer what fieldnanes (in this 
<2> We'll sort the CSV header-fields alphabetically for readability. 
<3> Writes the CSV-file header ("category,nane, . 

You’11 probably be reading csv files more often than writing themf 
Lefs read back the nobel_winners. csv file we just wrote. 

If you just want to use csv as a superior and eminently adaptable file 
line-reader, a couple of lines will produce a handy iterator, which 
can deliver your CSV rows as lists of strings: 

inport csv 

with open( 'data/nobel_winners.csv' ) as f: 
reader = csv.reader(f ) 
for row in reader: O 
print(row) 


Out: 

['category', 'nane', 'nationality', 'sex', 'year'] 

['Physics', 'Albert Einstein', 'Swiss', 'nale', '1921'] 
['Physics', 'Paul Dirae', 'British', 'nale', '1933'] 

[ 'Chenistry' , 'Marie Curte', 'Polish', 'fenale', '1911'] 

O Iterates over the reader object, consuming the lines in the file. 

Note that the numbers are read in string-form. If you want to 
manipulate them naturally you’11 need to convert any numeric col- 
umns to their respective type, in this case integer years. 


1 I recommend using JSON over CSV as your preferred data-format. 
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Usually a more convenient way to consume CSV data is to convert 
the rows into Python dictionaries. This record form is also the one 
we are using as our conversion target (a Irst of dlcts). csv has a 
handy DictReader for just this purpose: 

inport csv 

with open( 'data/nobel_winners.csv' ) as f: 
reader = csv.DlctReader(f ) 
nobel_winners = list(reader) O 

nobel_wtnners 

Out: 

[{'category ' : 'Physics', 'nationaltty' : 'Swiss', 'year': '1921', \ 
'nane': 'Albert Einstein', 'sex': 'male'}, 

{'category': 'Physics', 'natlonality' : 'Britlsh', 'year': '1933',\ 
'nane': 'Paul Dirae', 'sex': 'nale'}, 

{'category': 'Chenistry', 'nationaltty': 'Polish', 'year': '1911',\ 
'nane': 'Marie Curie', 'sex': 'fenale'}] 

O Inserts all of the reader items into a list. 

As the output shows, we just need to cast the dlcts year attributes 
to integers to conform nobel_winners to the chapters target data 
(Example 3-1), thus: 

for w in nobel_winners: 

w['year'] = int(w[ 'year' ]) 

The csv readers dont infer data-types from your file, interpreting 
everything as a string. Pandas, Pythons pre-eminent data-hacking 
library, wiU try and guess the correct type of the data columns, usu¬ 
ally successfully We’ll see this in action in the later dedicated Pandas 
chapters. 

csv has a few useful arguments to help parse members of the CSV- 
family: 

• dialect by default excel, specifles a set of dialect-speciflc param- 
eters. excel-tab is a sometimes used alternative. 

• delimiter. usually files are comma-separated but they could use 
I, : or ' ' instead. 

• quotechar. by default double-quotes are used but you occasion- 
ally find | or "' instead. 


82 I Chapter 3: Reading and Writing Data with Python 




You can find the full set of csv parameters here. 

Now weVe successfully written and read our target data using the 
csv module, let’s pass on our CSV-derived nobel_wlnners dict to 
the json module. 

JSON 

In this section we’ll write and read our nobel_winners data using 
Pythons json module. Lets remind ourselves of the data we’re using; 

nobel_wtnners = [ 

{'category': 'Physics', 

'name': 'Albert Einsteln', 

'natlonality' : 'Swiss', 

'sex' : 'nate' , 

'year': 1921}, 

] 

For data-primitives such as strings, integers, floats etc.., Python dic- 
tionaries are easily saved (or dumped in the JSON vernacular) into 
JSON-files, using the json module. The dump method takes a Python 
Container and a flle-pointer, saving the former to the latter: 

inport json 

with open( 'data/nobel_winners.json' , 'w') as f: 
json.dump(nobel_wlnners, f) 

open( 'data/nobel_winners.json' ). read( ) 

Out: '[{"category": "Physics", "nane": "Albert Einsteln", 

"sex": "nate", "person_data" : {"date of birth": "14th March 
1879", "date of death": "18th Aprii 1955"}, "year": 1921, 
"natlonality": "Swiss"}, {"category": "Physics", 

"natlonality": "British", "year": 1933, "nane": "Paul Dirae", 
"sex": "nate"}, {"category": "Chenistry", "natlonality": 
"Pollsh", "year": 1911, "nane": "Marle Curle", "sex": 

"fenale"}]' 

Reading (or loading) a JSON-flle is just as easy. We just pass the 
opened JSON-file to the json modules load method: 

inport json 

with open( 'data/nobel_winners.json' ) as f: 
nobel_winners = json.load(f) 

nobel winners 
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Out: 

[{u'category' : u'Physics', 
u'nane': u'Albert Einstein', 
u'natlonality' : u'Swiss', 
u'sex' : u'male' , 
u'year' : 1921}, O 


O Note that, unlike our CSV-file conversion, the integer type of 
the year column is preserved. 

json has loads and durnps counterparts to the file access method, 
which load and dump JSON object-strings respectively. 

Dealing with dates and times 

Trying to dump a date(time) object to json produces a TypeError: 
from datetime inport datetlme 

j son . durnps(datetlme . now( )) 

Out: 

TypeError: datetlme. datetime(2015, 9, 13, 10, 25, 52, 586792) 
is not 350N serlallzable 

When serializing simple data-types such as strings or numbers, the 
default json encoders and decoders are fine. But for more special- 
ised data such as dates you will need to do your own encoding and 
decoding. This isnt as hard as it sounds and quickly becomes rou- 
tine. Lefs first look at encoding your Python datetimes into sensi- 
ble JSON strings. 

The easiest way to encode Python data containing datetimes is to 
create a custom encoder like the one shown in Example 3-2 which is 
provided to the json.durnps method as a cis argument. This encoder 
is applied to each object in your data in turn and converts and dates 
or date-times to their ISO-format string (see “Dealing with Dates, 
Times and Complex Data” on page 102). 

Example 3-2. Encoding a Python datetime to JSON 

import datetlme 

from dateutll import parser 

import json 
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class JSONDateTineEncoder( json . JSONEncoder) : O 
def default(self , obj): 

if IsinstanceCobj , (datetlne.date, datetlme.datetlne)) : @ 
return obj.isoformat() 

else: 

return json.JSONEncoder.default(self, obj) 
def dunps(obj): 

return json.dumps(obj , cls=JSONDateTlneEncoder) & 

O Subclasses a JSONEncoder in order to create customised date- 
handling one. 

& Tests for a datetime object and if true returns the isoformat of 
any dates or datetimes, e.g. 2015-09-13T10:25:52.586792 

e> Uses the cis argument to set our custom date-encoder. 

Lefs see how our new durnps method copes some datetime data: 

now_str = dumps({ 'tine' : datetlne.now()}) 

now_str 

Out: 

'{"tine": "2015-09-13110:25:52.586792"}' 

The time field is correctly converted into an ISO-format string, 
ready to be decoded into a JavaScript Date object (see “Dealing with 
Dates, Times and Complex Data” on page 102 for a demonstration). 

. While you could write a generic decoder to cope with date-strings 
in arbitrary JSON flles^, ifs probably not advisable. Date-strings 
come in so many weird and wonderful varieties that this is a job best 
done by hand on what is pretty much always a known data-set. 

The venerable strptime method, part of the datetime.datetime 
package is good for the job of turning a time-string in a known for¬ 
mat into a Python datetime instance: 

In [0]: tlne_str = '2012/01/01 12:32:11' 

In [1]: dt = datetime.strptine(tlne_str, '9oY/%m/%d %H:%M:%S') O 


2 The Python module dateutll has a parser that will parse most dates and times sensibly 
and might be a good basis for this. 
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In [2]: dt 

Out[2]: datetine.datetlne(2012, 1, 1, 12, 32, 11) 

O strptime tries to match the time-string to a format string using 
various directives such as %Y (year with century) and %H (hour 
as a zero-padded decimal number). If successful it creates a 
Python datetirne instance. See here for a full list of the direc¬ 
tives available. 

If strptime is fed a time-string that does not match its format it 
throws a handy ValueError: 

dt = datetime.strptlneC 1/2/2012 12:32:11', '%Y/%n/%d %H:%M:%S') 

ValueError Traceback (most recent call last) 

<ipython-input-lll-af657749a9fe> in <nodule>() 

-> 1 dt = datetirne.strptineC '1/2/2012 12:32:11', 'SSY/Xni/Xd 7oH:7oM:%S') 

ValueError: time data '1/2/2012 12:32:11' does not match 
format '7oY/%m/%d %H:7«M:7oS' 

So to convert date flelds of a known format into datatimes for a 
data list of dictionaries you could do something like this: 

for d in data: 
try: 

d['date'] = datetirne.strptime(d [' date '], '7Y/%m/%d %H:%M:%S') 

except ValueError: 

print('Oops! - invalid date for ' + repr(d)) 

Now that weVe dealt with the two most popular data flle-formats 
lefs shift to the big-guns and see how to read or data from and write 
our data to SQL and NoSQL databases. 

SQL 

For interacting with an SQL-database, sqlalchemy is by some way 
the most popular and, in my opinion, best Python library. It allows 
you to use raw SQL instructions if speed and efficiency is an issue 
but also provides a powerful object relational mapping (QRM) 
which allows you to operate on SQL-tables using a high-level, 

Pythonic API, treating them essentially as Python classes. 

Reading and writing data using SQL while aUowing the user to treat 
that data as a Python Container is a complicated process and while 
sqlalchemy is far more user-friendly than using a low-level SQL- 
engine, it is stiU a fairly complex library. FU be covering the basies 
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here, using our data as a target but would encourage you to spend a 
little time reading some of the rather excellent documentation here. 
Lets remind ourselves of the nobel_winners data-set we’re aiming to 
write and read: 

nobel_wtnners = [ 

{'category': 'Physics', 

'nane': 'Albert Einstein', 

'natlonality' : 'Swiss', 

'sex' : 'male' , 

'year': 1921}, 


] 

Lefs first write our target data to an SQLite file using SQLAlchemy, 
starting by creating the database engine. 

Creating the database engine 

The first thing you need to do when starting an sqlalcheny session 
is to create a database engine. This engine will establish a connec- 
tion with the database in question and perform any conversions 
needed to the generic SQL instructions being generated by sqlal 
cherny and the data being returned. 

There are engines for pretty much every popular database as well as 
a memory option, which holds the database in RAM, aUowing fast 
access for testing etc.L The great thing about these engines is that 
they are interchangeable, which means you could develop your code 
using the convenient flle-based SQLite database and then switch in 
production to something a little more industrial, say Postgresql, by 
changing a single config-string. Check here for the fuU list of 
engines available. 

The form for specifying a database URL is 

dlalect+drlver://username:password@host:port/database 

So, to connect to a nobel^rize MySQL database running on local- 
host would require something like this. Note that the 


3 On a cautionary note, it is probably a bad idea to use different database configurations 
for testing and production. 
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create_engine does not actually make any SQL requests at this 
point, merely sets up the framework for doing so'*. 

engtne = create_englne( 'nysql://kyran:nypsswd@localhost/nobel_prlze' ) 

We’ll use a file-based SQLite database, setting the echo argument to 
true, which will output any SQL instructions generated by SQLAl- 
chemy. Note the use of three back-slashes after the colon: 

fron sqlalcheny inport create_englne 

engtne = create_englne(' sqltte:///data/nobel_prize.db' , echo=True) 

SQLAlchemy offers various ways to engage with databases but I 
would recommend using the more recent declarative style unless 
there are good reasons to go with something more low-level and 
fme-grained. In essence, with declarative mapping you sub-class 
your Python SQL-table classes from a base and SQLAlchemy intro- 
spects their structure and relationships. See here for more details. 

Defining the database tables 

We first create a Base class using declaratlve_base. This base will 
be used to create table-classes, from which SQLAlchemy will create 
the databases table schemas. You can use these table-classes to inter- 
act with the database in a fairly Pythonic fashion. Note that most 
SQL-libraries require you to formally define table-schemas. This is 
in contrast to such schemaless NoSQL variants as MongoDB. We’11 
take a look at the Dataset library later in this chapter, which enables 
schemaless SQL. 

Using this Base we define our various tables, in our case a single Win 
ner table. Example 3-3 shows how to subclass Base and use sqlal 
chemy’s datatypes to define a table-schema. Note the 

_tablename_member, which will be used to name the SQL-table 

and as a keyword to retrieve it, and the optional custom_repr_ 

method, which will be used when printing a table-row. 

Example 3-3. Defining an SQL-database table 

fron sqlalcheny inport Colunn, Integer, Strlng, Enun 

n ... 


4 See details here of this lazy initialization. 
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class Winner(Base) : 

_tablenane_ = 'winners' 

Id = Colunn(Integer, prlmary_key=True) 

nane = Colunn(Strtng) 

category = Colunn(Strlng) 

year = Colunn(Integer) 

natlonallty = Colunn(String) 

sex = Colunn(Enun('nale' , 'fenale')) 

def _repr_ (self): 

return "<Winner(nane= '%s ', category='%s', year='%s' )>"\ 
%(self.nane, self .category, self. year) 

Having declared our Base subclasses in Example 3-3 we supply its 
nietadata create_all method with our database engine to create 
our database^. Because we set the echo argument to true when creat- 
ing the engine, we can see the SQL instructions generated by 
SQLAlchemy from the command-line: 

Base . netadata . create_all(englne) 

INFO:sqlalcheny.engine.base.Engine SELECT CAST('test plaln 

returns' AS VARCHAR(60)) AS anon_l 

INFO sqlalcheny.engine.base.Engine 

CREATE TABLE winners ( 

Id INTEGER NOT NULL, 
nane VARCHAR, 
category VARCHAR, 
year INTEGER, 
natlonallty VARCHAR, 
sex VARCHAR(6), 

PRIMARY KEY (Id), 

CHECK (sex IN ('nale', 'fenale')) 

) 

INFO:sqlalcheny.engine.base.EngineiCOMMIT 

With our new winners table declared we can start adding winner 
instances to it. 


5 This assumes the database doesn’t already exist. If it does the Base will be used to create 
new insertions and to interpret retrievals. 
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Adding instances with a session 

Now that we have created our database, we need a session to interact 
with: 

fron sqlalcheny.orn inport sesslonnaker 

Session = sesstonnaker(bind=engine) 
session = SessionO 

We can now use our Winner class to create instances/database-rows 
and add them to the session: 

aibert = Winner(**nobei_winners[0]) O 
session.add(albert) 
session . new O 
Out: 

IdentitySet([<Winner(nane='Aibert Einstein' , category=' Physics' , 
year='1921' )>]) 

O Pythons handy operator unpacks our first nobel_winners 
member into key-value pairs, i.e. (mme=Albert Einstein, cate- 
gorY=Physics...). 

@ new is the set of any items that have been added to this session. 

Note that all database insertions, deletions etc. take place in Python. 
Its only when we use the comnit method that the database is altered. 



Use as few commits as possible, allowing 
SQLAlchemy to work its magic behind the 
scenes. When you commit, your various data¬ 
base manipulations should be summarised by 
SQLAlchemy and communicated in an efficient 
fashion. Commits involve establishing a data¬ 
base handshake and negotiating transactions, 
often a slow process and one you want to limit 
as much as possible, leveraging SQLAlchemys 
book-keeping abHities to full advantage. 


As the new method shows, we have added a Winner to the session. 
We can remove the object using expunge, leaving an empty Identi 
tySet: 
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sesston.expunge(albert) O 

session . new 

Out: 

IdentltySet([]) 

O Remove the instance from the session (there is an expunge_all 
method which removes all new objects added to the session.). 

At this point no database insertions or delections have taken place. 
Lets add all the members of our nobel_winners list to the session 
and commit them to the database: 

winner_rows = [Wlnner(**w) for w in nobel_wlnners] 
session . add_all(winner_rows) 
session. conmitO 
Out: 

INFO:sqlalcheny.engine.base.EngineiBEGIN (implicit) 

INEO:sqlalcheny.engine.base.Engine:INSERT INTO winners (nane, 
category, year, nationaiity, sex) VALUES (?, ?, ?, ?, ?) 

INFO:sqiaicheny.engine.base.Engine: (u 'Aibert Einstein' , 
u'Physics', 1921, u'Swiss', u'naie') 

INFO:sqiaicheny.engine.base.Engine:COMMIT 

Now that weVe committed our nobel_winners data to the database, 
lets see what we can do with it and how to recreate the target list 
Example 3-1. 

Querying the database 

To access data you use the sessiones query method, the resuit of 
which can be filtered, grouped, intersected etc., allowing the full 
range of Standard SQL data retrieval. You can check out querying 
methods available here. For now TU quickly run through some of 
the most common queries on our Nobel data-set. 

Lefs first count the number of rows in our winners’ table: 

session.query(Winner) .countQ 
Out: 

3 

Next, lefs retrieve all Swiss winners: 

resuit = session.query(Winner) . fiiter_by(nationaiity=' Swiss' ) O 
iist(resuit) 

Out: 

[<Winner(nane='Aibert Einstein', category= 'Physics' , year= '1921' )>] 
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O fllter_by uses keyword-expressions, its SQL-expressions 
counterpart being filter, e.g. filter(Winner.nationality == 

Swiss). 

Now let’s get ali non-Swiss Physics winners: 

resuit = sesslon.query(Wlnner).filter(\ 

Wlnner . category == 'Physics', Wlnner . natlonality != 'Swlss') 

list(result) 

Out: 

[<Winner(name= 'Paul Dirae', category= 'Physics' , year=' 1933' )>] 

Heres how to get a row based on Id-number: 

session.query(Winner) .get(3) 

Out: 

<Winner(nane='Marie Curie' , category= 'Chenistry' , year= '1911' )> 

Now let’s retrieve winners ordered by year: 

res = session.query(Winner).order_by(' year' ) 
list(res) 

Out: 

[<Winner(name='Marie Curie', category= 'Chenistry' , year= '1911' )>, 
<Winner(name='Albert Einstein', category= 'Physics' , year= '1921' )>, 
<Winner(name= 'Paul Dirae', category= 'Physics' , year=' 1933' )>] 

To reconstruet our target list requires a little effort converting the 
Winner objects returned by our session query into Python diets. 

Lets write a little function to create a dict from an SQLAlchemy 
class. We’ll use a little table-introspection to get the column labeis 
(see Example 3-4). 


Example 3-4. Converts a SQLAlchemy instance to a dict 

def inst_to_dict(inst, delete_id=True) : 
dat = {} 

for column in inst._table_.columns: O 

dat[column . name] = getattr(inst, column.name) 
if delete_id: 

dat.pop( 'id' ) 
return dat 

O Accesses the instances table class to get a list of column objects. 
@ If delete_id is true, remove the SQL primary id field. 
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We can use Example 3-4 to reconstruet our nobel_winners target 
list: 


winner_rows = sesston.query(mnner) 

nobel_wtnners = [tnst_to_dlct(w) for w in wlnner_rows] 

nobel_winners 

Out: 

[{ 'category' : u'Physlcs', 

'name': u'Albert Einstein', 

'natlonality' : u'Swtss', 

'sex' : u'male' , 

'year': 1921}, 


] 

You can update database rows easily by changing the property of 
their reflected objects: 

marle = sesslon.query(Wlnner) .get(3) O 
marle.natlonality = 'French' 
sesslon.dlrty 
Out: 

IdentltySet([<Winner(name='Marle Curte', category='Chemlstry' , 
year='1911' )>]) 

O Fetehes Marie Curie, natlonality Polish. 

© dlrty shows any changed instances not yet committed to the 
database. 

Lets commit niarie’s changes and check that her natlonality has 
changed from Polish to French: 

sesslon. commitO 
Out: 

INFO:sqlalchemy.englne.base.Englne:UPDATE wlnners SET 
natlonallty=? WHERE wlnners.id = ? 

INFO:sqlalchemy.englne.base.Engine: ( 'French' , 3) 


sesslon.dlrty 
Out: 

IdentltySet([]) 

sesslon.query(Wlnner).get( 3) . natlonality 
Out: 

' French' 

As well as updating database rows you can delete the results of a 
query: 
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session.query(Wlnner).fllter_by(nane= 'Albert Elnsteln' ) .delete () 
Out: 

INFO:sqlalcheny.englne.base.Engine:DELETE FROM wlnners WHERE 
wlnners.nane = ? 

INFO:sqlalcheny.englne.base.Engine: ( 'Albert Elnsteln' ,) 

1 

llst(sesslon.query(Wlnner) ) 

Out: 

[<Wlnner(nane= 'Paul Dirae', category= 'Physlcs' , year=' 1933' )>, 
<Wlnner(nane='Marle Curle', category= 'Chenlstry' , year= '1911' )>] 

You can also drop the whole table if required, using the declarative 
classs table attribute: 

Wlnner . _table_.drop(englne) 

In this section weVe dealt with a single winners table, without any 
foreign-keys or relationship to any other tables, akin to a CSV or 
JSON file. SQLAlchemy adds the same level of convenience to deal- 
ing with such database tables many-to-one, one-to-many etc. rela- 
tionships as it does to basic querying, using implicit joins, by 
supplying the query method with more than one table class, or 
explicitly using the querys join method. Check out the examples 
here for more details. 

Easier SQL with Dataset 

One library IVe found myself using a fair deal recently is Dataset, a 
module designed to make working with SQL databases a little easier 
and more Pythonic than existing powerhouses like SQLAlchemy^. 
Dataset tries to provide the same degree of convenience you get 
when working with schemaless NoSQL databases such as Mon- 
goDB, removing a lot of the formal boilerplate, such as schema defi- 
nitions, demanded by the more conventional libraries. Dataset is 
built on top of SQLAlchemy which means it works with pretty much 
all major databases and can exploit the power, robustness and 
maturity of that best-of-breed library. Lets see how it deals with 
reading and writing our target dataset (from Example 3-1). 

Lefs use the SQLite nobel_prize.db database weVe just created to 
put Dataset through its paces: 


6 Datasets official motto being ‘databases for lazy people’ 
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First we connect to our SQL database, using the same URL/file for¬ 
mat as SQLAlchemy: 

inport dataset 

db = dataset . connecte 'sqlite:///data/nobel_pri 2 e.db' ) 

To get our list of winners we grab a table from our db database, 
using its name as a key, and then use the find method without argu- 
ments to return all winners: 

wtable = db[ 'winners' ] 
winners = wtable.find() 
winners = list(winners) 
winners 
Out: 

[OrderedDict( [(u 'id' , 1), (u'nane', u'Albert Einstein'), 
(u'category' , u'Physics'), (u'year', 1921), (u 'nationality' , 
u'Swiss'), (u'sex', u'nale')]), OrderedDict( [(u 'id' , 2), 
(u'nane', u'Paul Dirae'), (u 'category' , u'Physics'), 

(u'year', 1933), (u 'nationality' , u'British'), (u'sex', 
u'nale')]), OrderedDict( [(u 'id' , 3), (u'name', u'Marie 
Curie'), (u' category ', u' Chemistry '), (u'year', 1911), 

(u 'nationality' , u'Polish'), (u'sex', u'fenale' )])] 

Note that the instances returned by Datasefs find method are Order 
edDiets. These useful containers are an extension of Pythons dict 
class which behave just like dictionaries but remember the order in 
which items were inserted, meaning you can guarantee the resuit of 
iteration, pop the last item inserted etc. This is a very handy addi- 
tional functionality. 


One of the most useful Python batteries for data- 
mungers is collections, from where Dataset’s 
OrderedDicts come. The defaultdict and 
Counter classes are particularly useful. Check 
out whafs available here. 

Lefs recreate our winners table with Dataset, first dropping the 
existing one: 

wtable = db[ 'winners' ] 
wtable. dropO 

wtable = db[ 'winners' ] 
wtable.find() 
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Out: 

[] 


To recreate our dropped winners table, we don’t need to define a 
schema as with SQLAlchemy (see “Defining the database tables” on 
page 88). Dataset will infer that from the data we add, doing all the 
SQL creation implicitly. This is the kind of convenience one is used 
to when working with collection-based NoSQL databases. Lefs use 
our nobel_winners dataset (Example 3-1) to insert some winner 
dictionaries. We use a database transaction and the with statement 
to efficiently insert our objects and then commit them^. 

with db as tx: O 

for w in nobel_winners: 

tx[ 'winners' ] .Insert(w) 

O Use the with statement to guarantee the transaction tx is com- 
mitted to the database. 

Lefs check that everything has gone well: 

list(db[ 'winners' ] .findO) 

Out: 

[OrderedDict( [(u 'id' , 1), (u'nane', u'Albert Einstein'), 

(u 'category' , u'Physics'), (u'year', 1921), (u 'nationality' , 
u'Swiss'), (u'sex', u'naie')]), 

] 

The winners have been correctly inserted and their order of inser- 
tion preserved by the OrderedDict. 

Dataset is great for basic SQL-based work, particularly retrieving 
data you might wish to process or visualise. For more advanced 
manipulation it allows you to drop down into SQLAlchemys core 
API using the query method. 

On top of its huge convenience, Dataset has a f reeze method which 
is a great asset to budding data-visualisers. freeze will take the 
resuit of an SQL-query and turn it into a JSON or CSV file, a very 
convenient way to start playing around with the data with Java- 
Script/D3: 

winners = db[ 'winners ']. find() 
dataset.freeze(winners, fornat= 'csv' , \ 


7 See here for further details of how to use transactions to group updates. 
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fllename= 'data/nobel_wlnners_ds.csv' ) 

open( 'data/nobel_winners_ds.csv' ). read( ) 

Out: 

'ld,nane,category,year,natlonallty,sex\r\n 

1 , Albert Elnsteln , Physics , 1921, Swiss ,nale\r\n 

2, Paul Dtrac,Physics, 1933, Brttish,nale\r\n 

3, Marte Curle,Chenlstry, 1911,Polish,fenale\r\n ' 

Now that weVe covered the basies of working with SQL databases, 
let’s see how Python makes working with the most popular NoSQL 
database just as painless. 

MongoDB 

Document-centric data-stores like MongoDB offer a lot of conve- 
nience to data wranglers. As with all tools, there are good and bad 
use-cases for NoSQL databases but if you have data that has already 
been refined and processed and don’t anticipate needing SQLs pow- 
erful query language, based on optimised table-joins etc., MongoDB 
will probably prove easier to work with. MongoDB is a particularly 
good fit for web-dataviz because it uses Binary JSQN (BSQN) as its 
data-format. An extension of JSQN, BSQN can deal with binary- 
data, datetime-objects etc.. and plays very nicely with JavaScript. 

Lefs remind ourselves of the target data-set we’re aiming to write 
and read: 

nobel_wlnners = [ 

{'category': 'Physics', 

'name': 'Albert Einsteln', 

'nationality' : 'Swiss', 

'sex' : 'nale' , 

'year': 1921}, 


] 

Creating a MongoDB collection with Python is the work of a few 
lines: 

fron pypiongo inport MongoClient 

Client = MongoClient( ) O 
db = Client.nobel_prize & 
coli = db.winners e> 


o Creates a Mongo-client, using the default host and ports. 
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@ Creates or accesses the nobel_prize database. 

e> If a winners collectiori exists this retrieves it, otherwise (as in 
our case) it creates it. 


Using Constants for MongoDB Access 

Accessing and creating a MongoDB database with Python involves 
the same operation, using dot-notation or square-bracket key- 
access; 

db = citent.nobel_prlze 
db = client[ 'nobel_prlze' ] 

This is all very convenient but it means a single spelling mistake, 
e.g. noble_prize, could both create an unwanted database and future 
operations faU to update the correct one. For this reason I would 
advise using constant strings to access your MongoDB databases 
and collections: 

DB_NOBEL_PRIZE = 'nobel_prize' 

COLL_WINNERS = 'winners' 

db = client[DB_NOBEL_PRIZE] 
coli = db[COLL_WINNERS] 


MongoDB databases run on localhost port 27017 by default but 
could be anywhere on the web. They also take an optional username 
and password. Example 3-5 shows how to create a simple utility 
function to access our database, with Standard defaults. 


Example 3-5. Accessing a MongoDB database 

fron pyrnongo inport MongoCllent 

def get_mongo_database(db_nane, host= 'localhost' ,\ 

port=27017, usernane=None, password=None) : 
Get naned database fron MongoDB with/out authentication " 
# piake Mongo connection with/out authentication 
if usernane and password: 

mongo_uri = 'nongodb://%s:%s@%s/%s '%\ O 
(username, password, host, db_name) 
conn = MongoClient(mongo_uri) 
else: 

conn = MongoClient(host, port) 
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return conn[db_nane] 


O We specify the database-name in the mongo URI (Uniform 
Resource Identifier) as the user may not have general privilages 
for the database. 

We can now create a Nobel-prize database and add our target data- 
set (Example 3-1). Lets first get a winners collection, using the string 
constants for access: 

db = mongo_to_database(DB_NOBEL_PRIZE) 
coti = db[COLL_WINNERS] 

Inserting our Nobel dataset is then as easy as can be: 

coll.lnsert(nobel_wlnners) 

Out: 

[0bjectld( '55f8326f26a7112e547879d4' ), 

0bjectld( '55f8326f26a7112e547879d5' ), 

ObjectIdC '55f8326f26a7112e547879d6' )] 

The resulting array of Objectids can be used for future retrieval but 
MongoDB has already left its stamp on our nobel_wlnners list, 
adding a hidden id property®: 

nobel_wtnners 

Out: 

[{ '_td' : Objectid ( '55f8326f26a7112e547879d4' ), 

'category': u'Physlcs', 

'name': u'Albert Etnstein', 

'natlonality' : u'Swiss', 

'sex' : u'nate' , 

'year': 1921}, 

] 


8 One of the cool things about MongoDB is that the Objectids are generated client-side, 
removing the need to quiz the database for them. 
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MongoDBs Objectids have quite a bit of hid- 
den functionality, being a lot more than a simple 
random identifier. You can, for example, get the 
i generation time of the Objectid, giving you 
access to a handy time-stamp: 

old = bson.ObjectIdO 
old . generation_tlne 

Out: datetlme.datetlne(2015, 11, 4, 15, 43, 23... 
Find the full details here. 

Now that weVe got some items in our winners collection, MongoDB 
makes finding them very easy its, with its find method taking a dic- 
tionary query: 

res = coli.flnd({ 'category' : 'Chenlstry' }) 
llst(res) 

Out: 

[{u'_id' : 0bjectld('55f8326f26a7112e547879d6' ), 
u'category': u'Chenlstry' , 
u'nane': u'Marte Curie', 
u'natlonallty' : u'Polish', 
u'sex' : u'fenale' , 
u'year': 1911}] 

There are a number of special dollar-prefixed operators which allow 
for sophisticated querying. Lets find all the winners after 1930 using 
the $gt (greater-than) operator: 

res = coli.flnd({ 'year' : {'$gt': 1930}}) 
llst(res) 

Out: 

[{u'_id' : 0bjectld('55f8326f26a7112e547879d5'), 
u'category': u'Phystcs', 
u'nane': u'Paul Dirae', 
u'natlonallty' : u'Brltlsh', 
u'sex' : u'nale' , 
u'year' : 1933}] 

You can also use Boolean expression, e.g. to find all winners after 
1930 or female: 

res = coli.flnd({ '$or ':[{' year' : {'$gt': 1930}}, {'sex' : 'fenale'}]}) 
llst(res) 

Out: 

[{u'_ld' : 0bjectld('55f8326f26a7112e547879d5'), 
u'category': u'Physlcs', 
u'nane': u'Paul Dirae', 
u'natlonallty' : u'Brltlsh', 
u'sex' : u'nale' , 
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u'year': 1933}, 

{u '_id' : ObjectId( '55f8326f26a7112e547879d6' ), 
u'category': u'Chenistry' , 
u'nane': u'Marte Curte', 
u'nattonaltty' : u'Poltsh', 
u'sex' : u'fenale' , 
u'year': 1911}] 

You can find the full list of available query expressions here. 

As a final test, let’s turn our new winners collection back into a 
Python list of dictionaries. We’ll create a little utility function for the 
task: 

def nongo_coll_to_dtcts(dbnane= 'test' , cotlname= 'test' ,\ 
query={}, del_td=True, **kw) : O 

db = get_nongo_database(dbname, **kw) 

res = ttst(db[cotlnane] . ftnd(query) ) 

tf del_td: 

for r in res: 
r.pop('_td') 

return res 

O An empty query dict {} will find ali documents in the collection. 
del_id is a flag to remove MongoDBs ObjectId’s from the 
items by default. 

We can now create our target dataset: 

nongo_cott_to_dtcts(DB_NOBEL_PRIZE, COLL_WINNERS) 

Out: 

[{u 'category' : u'Phystcs', 
u'nane': u'Albert Etnstetn', 
u'nattonaltty' : u'Swtss', 
u'sex' : u'male' , 
u'year': 1921}, 

] 

MongoDBs schemaless databases are great for fast-prototyping in 
solo work or small teams. There will probably come a point, particu- 
larly with large code-bases, where some formal schema is a useful 
reference and sanity-check but while settling on the right data- 
model, the ease with which document forms can be adapted is a 
bonus. 
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Being able to pass Python dictionaries as queries to pymongo and 
having access to client-side generated Objectids are couple of other 
conveniences. 

WeVe now passed the nobel_winners data in Example 3-1 through 
all our required file-formats and databases. Lets considar the special 
case of dealing with dates and times before summing up. 

Dealing with Dates, Times and Complex Data 

The ability to deal comfortably with dates and times is fundamental 
to data-viz work but can be quite tricky. There are many ways to 
represent a date or date-time as a string, each one requiring a sepa¬ 
rate encoding or decoding. For this reason it’s good to settle on one 
format in your own work and encourage others to do the same. Td 
recommend using the 1 International Standard Organisation (ISO) 
time-format as your string representation for dates and times and 
using the Coordinated Universal Time (UTC) form^. Heres a few 
examples of ISO 8601 date and date-time strings: 

2015-09-23 A date (Python/C format-code %Y-%m-%cl) 

2015-09-23116:32:352 A UTC (Zafter time) date and time {T%H:%M:%S) 

2015-09-23116:32-1-02:00 A positive two hour (-1-02:00) offset from UTC, e.g. Centrai European 
Time 


Note the importance of being prepared to deal with different time- 
zones. These are not always on lines of longitude (see here) and 
often the best way to derive an accurate time is by UTC-time plus a 
geographic location. 

ISO 8601 is the Standard used by JavaScript and is easy to work with 
in Python. As web data-visualisers our key concern in creating a 
string representation that can be passed between Python and Javsa- 
cript using JSON and encoded and decoded easily at both ends. 


9 To get the actual local time from UTC you can store a time-zone offset or, better stili, 
derive it from a geo-coordinate; this is because time-zones do not follow lines of longi¬ 
tude very exactly 
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Lets take a date and time, in the shape of a Python datetine, con- 
vert it into a string, and then see how that string can be consumed 
by JavaScript. 

First we produce our Python datetime: 
fron datetine inport datetine 

d = datetine.now() 

d.lsofornatO 

Out: 

'2015-09-15121:48:50.746674' 

This String can then be saved to JSON, CSV etc. read by JavaScript 
and used to create a Date object: 

d = new Date('2015-09-15T21:48:50. 746674') 

> Tue Sep 15 2015 22:48:50 GMT+0100 (BST) 

We can return the date-time to ISO 8601 string form with the toISO 
String method: 

d.toISOStrlngO 

> "2015-09-15121:48:50.7462" 

Finally, we can read the string back into Python. 

If you know that youre dealing with ISO-format time-string, 
Pythons dateutil module should do the job'”]. But you’11 probably 
want to sanity-check the resuit: 

fron dateutil Inport parser 

d = parser. parse("2015-09-15T21:48:50. 7462") 
d 

Out: 

datetine. datetlne(2015, 9, 15, 21, 48, 50, 746000, tzlnfo=tzutc()) 

Note that weVe lost some resolution in the trip from Python to Jav- 
script and back again, the latter dealing in milii not micro seconds. 
This is unlikely to be an issue in any dataviz work but is good to bear 
in mind, just in case some strange temporal errors occur. 


10 To install just run ‘pip install dateutil’ dateutil is a pretty powerful extension of 
Pythons datetine, check it out here 
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Summary 

This chapter aimed to make you comfortable using Python to move 
data around the various file-formats and databases a data-visualiser 
might expect to bump into. Using databases effectively and effi- 
ciently is a skill it takes a while to leam but you should now be com¬ 
fortable with basic reading and writing for the large majority of 
dataviz use-cases. 

Now we have the vital lubrication for our dataviz tool-chain, let s get 
up to scratch on the basic web-dev skills you’11 need for the chapters 
ahead. 
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CHAPTER4 


Webdev101 


This chapter introduces the core web-development knowledge you 
will need to understand the web-pages you might want to scrape for 
data and to structure those you want to deliver, as the skeleton of 
your JavaScripted visualisations. As you’11 see, in modern web-dev a 
little knowledge goes a long way, particularly when your focus is 
building self-contained visualisations and not whole web-sites (See 
“Single-page Apps” on page 106 for more details). 

The usual caveats apply to a chapter in Part I — this chapter is part 
reference, part tutorial. There will probably be stuff here you know 
already so feel free to skip over it and get to the new material. 

The Big Picture 

The humble web-page, the coUection of which compromises the 
World Wide Web (WWW) -that fraction of the internet consumed 
by humans- is constructed from files of various types. Apart from 
the multi-media files, images, videos, sound etc.., the key elements 
are textual, consisting of hypertext markup-language (HTML), cas- 
cading style sheets (CSS), and JavaScript. These three, along with 
any necessary data-files, are delivered using the Hypertext Transfer 
Protocol (HTTP) and used to build the page you see and interact 
with in your browser window, which is described by the Document 
Object Model (DOM), a hierarchical tree off which your content 
hangs. A basic understanding of how these elements interact is vital 
to building modern web visualisations and the aim of this chapter is 
to get you quickly up to speed. 
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Web-development is a big field and the aim here is not to turn you 
into a full-fledged web-developer. I assume you want to limit the 
amount of basic web-dev as much as possible, focusing only on that 
fraction necessary to build a modern visualisation. In order to build 
the sort of visualisations showcased at d3js.org , the New York Times 
or incorporated in basic interactive data-dashboards you actually 
need surprisingly little webdev fu. The resuit of your labours should 
be easiLy added to a larger web-site by someone dedicated to that 
job. In the case of small, personal web-sites ifs trivial to do it your- 
self 

Single-page Apps 

Single-page applications (SPAs) are web applications (or whole sites) 
which are dynamicaUy assembled using JavaScript, often building 
from a lightweight HTML backbone and CSS-styles that can be 
applied dynamicaUy with class and id attributes. Many modern data- 
visualisations fit this description, including the Nobel-prize visuali¬ 
sation that this book builds towards. 

Often self-contained, the SPAs root-folder can be easily incorpora¬ 
ted in an existing web-site or stand alone, REQUIRING only an 
HTTP server such as Apache or Nginx. 

Thinking of our data-visualisations in terms of SPAs removes a lot 
of the cognitive overhead from the web-dev aspect of JavaScript vis¬ 
ualisations, allowing us to focus on the programming challenges. 
The skiUs required to put the visualisation on the web are stili fairly 
basic and quickly amortised. Often it will be someone elses job. 

ToolingUp 

As you’11 see, the web-dev needed to make modern data- 
visualisations requires no more than a decent text-editor, modern 
browser and a terminal (Figure 4-1). TU cover what I see as the min- 
imal requirements for a web-dev ready editor and non-essential but 
nice-to-have features. My browser development tools of choice are 
Chromes web-developer kit, freely available on all platforms. It has a 
lot of tab-delineated functionality and in this chapter TU cover: 

• The Elements tab, which aUows you to explore the structure of a 
web-page, its HTML content, CSS styles and DOM presenta- 
tion. 
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The Sources tab, where most of your JavaScript debugging will 
take place. 


You’ll need a terminal for output, starting your local web-server, 
sketching ideas with the IPython interpreter etc.. 


Editor 

• Multi-language aware 

• Syntax highlighting 

• Code-linting 

• Comfortable 


Browser 


• Modern 

• Good JS-engine 
PowerfuI debugger) 

• SVG compliant 

• Good WebGL a bonus 


V_y 

Console 

• Server logging 

• Output/logging from 
Python modules 


Figure 4-1. Primary Webdev Tools 




Before dealing with what you do need, lefs deal with a few things 
you really don’t need when setting out, laying a couple of myths to 
rest on the way 
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The myth of IDEs, frameworks and tools 

There is a common assumption among the prospective JavaScripter 
that to program for the web requires a complex toolset, primarily an 
Intelligent Development Environment (IDE), as used by Enterprise - 
and other- coders everywhere. This is both potentially expensive 
and presents another learning curve. The good news is that not only 
have I never used an IDE to program for the web but I eant think of 
anyone I know in the discipline who does. In ali probability the 
wonderful web-visualisations you have seen, that may have spurred 
you to pick up this book, were created with nothing more than a 
humble text-editor, a modern web-browser for viewing and debug- 
ging, and a console or terminal for logging and output. 

There is also a commonly believed myth that one cannot be produc¬ 
tive in JavaScript without using a framework of some kind. At the 
moment a number of these frameworks are vying for control of the 
JS ecosystem, sponsored by the various huge companies that created 
them. These frameworks come and go at a dizzying rate and my 
advice for anyone starting out in JavaScript is to ignore them 
entirely while you develop your core skills. Use small, targeted libra- 
ries, such as those in the jQuery ecosystem or Underscores func- 
tional programming extensions and see how far you can get before 
needing a my way or the highway framework. Only lock yourself into 
a Framework to meet a ciear and present need, not because the cur¬ 
rent JS group-think is raving about how great it is'. Another impor¬ 
tant consideration is that D3, the prime web datviz library, doesnt 
really play well with any of the bigger frameworks I know particu- 
larly the ones that want control over the DOM. 

Another thing you’11 flnd if you hang around web-dev forums, 
Reddit-lists, Stackoverflow etc.. is a huge range of tools constantly 
clamouring for attention. There are JS-tCSS minifiers, watehers to 
automatically detect flle-changes and reload web-pages during 
development etc. etc. While a few of these have their place, in my 
experience there are a lot of flaky tools which probably cost more 
time in hair-tearing than they gain in productivity. To reiterate, you 
can be very productive without such stuff and should only reach for 
one to scrateh a current iteh. Some, like Bower covered in this chap- 


1 I bear the scars so you dont have to. 
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ter, are keepers but very few are remotely essential for data- 
visualisation work. 

Your text editing work-horse 

First and foremost among your webdev tools is a text-editor you are 
comfortable with and which can, at the very least, do syntax high- 
lighting for multiple languages -in our case HTML, CSS, JavaScript 
and Python. You can get away with a plain, non-highlighting editor 
but in the long run it will prove a pain. Things like syntax- 
highlighting, code-linting, intelligent indentation and the like 
remove a huge cognitive load from the process of programming, so 
much so that I see their absence as a limiting factor. These are my 
minimal requirements for a text editor: 

• Syntax highlighting for all languages you use. 

• Configurable indentation levels and types for languages (e.g. 
Python 4 soft-tabs, JavaScript 2 soft-tabs). 

• Multiple windows/panes/tabs to allow easy navigation around 
your code-base. 

If you are using a relatively advanced text-editor, all the above 
should come as Standard with the exception of code-linting which 
will probably require a bit of configuration. 

My leading candidate for nice to have is a decent code-linter. If the 
mark of a useful tool is how much you would miss its absence then 
code-linting is easily in my top five. For scripting languages like 
Python and JavaScript, theres only so much intelligent code-analysis 
that can be achieved syntactically but just sanity-checking the obvi- 
ous syntax errors can be a huge time save. In JavaScript in particular, 
some mistakes are transparent, in the sense that things will run in 
spite of them, and will quite often produce confusing error- 
messages. A code-linter can save you time here and enforce good 
practice. Figure 4-2 shows a contrived example of a JavaScript code- 
linter in action. 

A recent addition to Ecmascript 5^ is a striet mode, which enforces a 
modern JavaScript context. This mode is recognized by most linters 


2 The specificatiori for modern JavaScript is defined by the Ecmascript lineage. 
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and can be invoked by placing ‘use striet’ at the top of your program 
or within a function, to restrict it to that context. Modern browsers 
should also honour the striet mode, throwing errors for non- 
eomplianee. In striet mode trying to assign foo = "bar"; will fail if 
foo hasnt been previously defined. See John Resig's niee explana- 
tion here. 


// you should use the function form of 'use striet' 

!'use striet'; 

// you included jQuery, but never used it 
'(function ($) { 

// foo not defined 
! foo = 'baa'; 

// pub is defined but never used 
I var pub = { 

// this is part of an object literal, not an assignment 
! init = function(response) { 

// respnse should be response 
I console.log(resDnse): 

} 

^ D 

//you're missing a semicolon here 
// also - its jQuery, not jquery 
!}(jquery)) 


Figure 4-2. A running code-linter analyses the JavaScript continuously, 
highlighting syntax errors etc. in red and addinga ‘!’ to the left ofthe 
offending line. 

Browser with development tools 

One of the reasons an IDE is pretty much redundant in modern 
web-dev is that the best place to do debugging is in the web-browser 
itself and such is the pace of change there that any IDE attempting 
to emulate that context would have its work cut out. On top of this, 
modern web-browsers have evolved a powerful set of debugging and 
development tools. Firefox’s Firebug lead the way but has since been 
surpassed by Chrome developer, which offers a huge amount of func- 
tionality, from sophisticated (certainly to a Pythonista) debugging 
(parametric breakpoints, variable watehes etc..) to memory and pro- 
cessor optimization profiling, device emulation (want to know what 
your web-page looks like on a smart-phone or tablet?) and a whole 
lot more. Chrome developer is my debugger of choice and will be 
used in this book. Like everything covered, it’s free as in beer. 
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Terminal or command-prompt 

The terminal or command-line wiU be where you initiate the vari- 
ous servers and probably output the useful logging information. Its 
also where you’11 try out Python modules etc.. or run a Python inter¬ 
preter {IPython being by some way the best). 

In OSX and Linux, this window is called a Terminal or xterm. In 
Windows it’s a command-prompt which should be available through 
clicking Start->All-Programs->Accessories. 

Buildinga Web-page 

There are four elements to a typical web-visualisation: 

• An HTML skeleton, with placeholders for our programmatic 
visualisation. 

• Cascading Style Sheets (CSS) which define the look and feel 
(e.g. border widths, colors, font-sizes, placement of content- 
blocks). 

• JavaScript to build the visualisation. 

• Data to be transformed. 

The first three of these are just text-files, created using our favourite 
editor and delivered to the browser by our web-server (of which 
more later - see ???). Lefs examine them in turn. 

Serving Pages with HTTP 

The delivery of the HTML, CSS and JS files that are used to make a 
particular web-page (and any data-files, multimedia video, audio), is 
negotiated between a server and browser using the Hypertext Trans¬ 
fer Protocol. HTTP provides a number of methods, the most com- 
monly used being GET, which requests a web resource, retrieving 
data from the server if aU goes well or throwing an error if it doesnt. 
We’ll be using GET, along with Pythons requests module, to scrape 
some web-page content in Chapter 6. 

To negotiate the HTTP browser generated HTTP requests you’11 
need a server. In development you can run a little server locally 
using Pythons command-line initialised SimpleHTTPServer, like 
thus: 
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$ python -n SinpleHTTPServer 
Serving HTTP on 0.0.0.0 port 8000 ... 

This server is now serving content locally on port 8000. You can 
access the site it’s serving by going to the URL http://localhost: 
8000 on your browser. 

SinpleHTTPServer is a nice thing to have and ok for demos and the 
like but is lacks a lot of basic functionality. For this reason, as we’ll 
see in ???, ifs better to master the use of a proper development (and 
production) server like Flask (this books server of choice). 

TheDOM 

The HTML files you send through HTTP are converted at the 
browser end into a Document Object Model or DOM, which can in 
turn be adapted by Javscript, this programmatic DOM being the 
basis of dataviz libraries like D3. The DOM is a tree structure, repre- 
sented by hierarchical nodes, the top node being the main web-page 
or Document. 

Essentially, the HTML you write or generate with a template is con¬ 
verted by the browser into a tree hierarchy of nodes, each one repre- 
senting an HTML element. The top node is called the “Document 
Object” and all other nodes descend in a parent-child fashion. Pro- 
grammatically manipulating the DOM is at the heart of such libra¬ 
ries as jQuery and the mighty D3 so ifs vital to have a good mental 
model of whafs going on. A great way to get the feel for the DOM is 
to use a web tool such as Chrome Developer (my recommended tool- 
set) to inspect branches of the tree. ??? shows the DOM tree of a 
HTML page, accessible through the Elements tab. 

Whatever you see rendered on the web-page, the book-keeping of 
the objects’ state (displayed or hidden, matrix transform etc.) is 
being done with the DOM. D3s powerful innovation was to attach 
data directly to the DOM and use it to drive visual changes (Data 
Driven Documents). 

The HTML skeleton 

A typical web visualisation use an HTML skeleton, on which to 
build the visualisation with JavaScript. 

HTML is the language used to describe the content of a web-page. It 
was first proposed by physicist Tim Berners Lee in 1980 while he 
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was working at the CERN particle accelerator complex in Switzer- 
land. It uses tags such as <div>, <iniage>, <h> to structure the con- 
tent of the page while CSS is used to define the look and feeP. The 
advent of HTML5 has reduced the boilerplate considerably but the 
essence is unchanged over those thirty years. 

Fully specced HTML used to involve a lot of rather confusing header 
tags but with HTML5 some thought was put into a more user- 
friendly minimalism. This is pretty much the minimal requirement 
for a starting template^: 

<!D0CTYPE html> 

<neta charset="utf-8"> 

<body> 

</-- page cantent --> 

</body> 

So we need only declare the document HTML, our character-set 
eight-bit Unicode and a body tag below which to add our pages con- 
tent. This is a big advance on the book-keeping required before and 
a very low threshold to entry, as far as creating the documents which 
will be turned into web-pages goes. Note the comment tag form: 
<!— comment 

Now, more realistically we would probably want to add some CSS 
and JavaScript. You can add both directly to an HTML document by 
using <style> and <script> tags like this: 

<!D0CTYPE html> 

<neta charset="utf-8"> 

<style> 

<!-- CSS --> 

</style> 

<body> 

<!-- page cantent --> 

<scrlpt> 

JavaScript --> 

</script> 

</body> 

This single-page HTML form is often used in examples such as 
those visualisations at d3js.org. Ifs convenient to have a single page 


2 You can code style in HTML tags, using the style attribute, but it’s generally bad prac- 
tice. Better to use classes and ids defined in CSS. 

3 as demonstrated by Mike Bostock here http://bost.ocks.Org/mike/d3/workshop/#8, with a 
hat-tip to Paul Irish 
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to deal with when demonstrating code or keeping track of files but 
generally Id suggest separating the HTML, CSS and JavaScript ele- 
ments into separate files. The big win here, apart from easier naviga- 
tion as the code-base gets larger, is that you can take fuU advantage 
of your editor s specific language enhancements such as solid syntax 
highlighting, code-linting (essentially syntax-checking on the fly) 
etc.. While some editors and libraries claim to deal with embedded 
CSS and JavaScript I havent found an adequate one. 

To use CSS and JavaScript files we just include them in the HTML 
using <link> and <script> tags like this: 

<!D0CTYPE html> 

<neta charset="utf-8"> 

<link rel="stylesheet" href="style.css" /> 

<body> 

page cantent --> 

<scrlpt type="text/ javascript" src="script . js"></scrlpt> O 
</body> 

O Note the async directive to allow the browser to continue pars- 
ing the page while the script loads^ 

Marking-up content 

Visualisations often use a small subset of the available HTML tags, 
usually building the page programmatically by attaching elements to 
the DOM-tree. 

The most common tag is the <div>, marking a block of content. 
<div>s can contain other <div>s, allowing for a tree hierarchy, the 
branches of which are used during element selection and to propa¬ 
gate user-interface (UI) events such as mouse-clicks. Heres a simple 
div hierarchy: 

<div ld="ny-chart-wrapper" class="chart-holder dev"> 

<div ld="ny-chart" class="bar-chart"> 

this Is a placeholder, with parent #my-chart-wrapper 

</div> 

</div> 

Note the use of id and class attributes. These are used when selecting 
DOM elements and to apply CSS styles. id’s are unique identifiers. 


2 See this Stackoverflow thread for a good explanatiori http://stackoverflow.com/ques 
tions/436411/where-is-the-best-place-to-put-script-tags-in-html-markup. 
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each element should have only one and there should only be one 
occurrence of any particular id per page. The class can be applied to 
multiple elements, allowing bulk-selection, and each element can 
have multiple classes. 

For textual content, the main tags are <p>, <h’*^> and <br>. You’11 be 
using these a lot. This code produces Figure 4-3: 

<h2>A Levet-2 Header</h2> 

<p>A paragraph of body-text with a line-break here..</br> 
and a second paragraph.. .</p> 

A Level-2 Header 

A paragraph of body-lexl wiih a line-break here., 
and a second paragraph... 

Figure 4-3. An h2 header and text 

Header tags are reverse ordered by size from the largest <hl>. 

<di.v>, <h*>, <p> are what is known as block elements. They nor- 
mally begin and end with a new line. The other class of tag is inline 
elements, which display without line-breaks. Images <img>, hyper- 
links <a>, and table cells <td> are among these, which include the 
<span> tag for inline text: 

<div ld="inllne-examples"> 

<ing src="path/to/inage.png" id="prettyptc"> O 
<p>This is a <a href="link-url">linl«/a> to 
<span class="url">link-url</span></p> @ 

</div> 

O Note that we don’t need a closing tag for images. 

© The span and link are continuous in the text 

Other useful tags include lists, ordered <ol> and unordered <ul>: 

<ol> 

<li>First tten</li> 

<li>Second ltem</li> 

</ol> 

HTML also has a dedicated <table> tag, useful if you want to 
present raw data in your visualisation. This HTML produces the 
header and row in Figure 4-4: 
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<table ld= 'chart-data' > 

<tr> O 

<th>Nane</th> 

<th>Category</th> 

<th>Country</th> 

</tr> 

<tr> @ 

<td>Albert Elnstein </td> 
<td>Physlcs</td> 
<td>Swlt 2 erland </td> 
</tr> 

</table> 

O The header row 
@ The first row of data 


Nanie Category Country 
Albert Einstein Physics Switzerland 


Figure 4-4. An FITML table 

When making web visualisations, the most often used of the tags 
above are the textual tags, to provide instructioris, information 
boxes etc.. But the meat of our JavaScripted efforts will probably be 
devoted to building DOM branches rooted on the Scalable Vector 
Graphics (SVG) <svg> and <canvas> tags. On most modern brows- 
ers the <canvas> tag also supports a 3D WebGL context, allowing 
OpenGL visualisations to be embedded in the page. 

We’ll deal with SVG, the focus of this book and the format used by 
the mighty D3 library, in a later section (“Scalable Vector Graphics 
(SVG)” on page 127). Now lefs look at how we add style to our con- 
tent blocks. 

css 

GSS, short for Gascading Style Sheets, is a language for describing 
the look and feel of a web-page. While you can hard-code style 
attributes into your HTML ifs generally considered bad practice^. 


2 This is not the same as programmatically setting styles, which is a hugely powerful 
technique allowing styles to adapt to user interaction etc. 
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Much better to label your tag with an id or class and use that to 
apply styles in the stylesheet. 

The key word in CSS is cascading - CSS follows a precedence rule so 
that in the case of a clash, the latest style overrides earlier ones. This 
means the order of inclusion for sheets is important. Usually you 
want your stylesheet to be loaded last so that you can override both 
the browser defaults and styles defined by any libraries you are 
using. 

Figure 4-5 shows how CSS is used to apply styles to the HTML ele- 
ments. First the element is selected, using hash #s to indicate a 
unique ID and dot .s to select members of a class. You then define 
one or more property->value pairs. Note that the font-family prop- 
erty can be a list of fallbacks, in order of preference. Here we want 
the browser default font-family of serif (capped strokes) to be 
replaced with the more modern sans-serif, with Helvetica Neue our 
flrst choice. 


Figure 4-5. Styling the page with CSS 

Understanding the CSS precedence rules is key to successfully 
applying styles. In a nutsheU the order is 

1. ! important after CSS property trumps all. 

2. The more speciflc the better. i.e. ids override classes. 


selector 


property 


5^my-vi^ { font-family: 'Helvetica Neue', 

Helvetica, Arial, sans-serif; } 
#leacl I font-size: 150%; } 

(.alert) f colorired; backgroundlyellow } 


j <div id="niy-vi2"> 

<div id=''lead''> 

<h2>A Leader header</h2> 
<p>Some e nlarged text f op 
<span Cclass="alert '" 
emphasis</span> 




Some enlarged text forlemphasis^. 


<p>and some normal sized te>t 

with our chosen font</p>'- and some normal sized text with our chosen font 
<div id="chart-holder"> 

<svg></svg> - 

</div> 

</div> 
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3. The order of declaration - last declaration wins, subject to 1. 
and 2. 

So, for example, say we have a <span> of class alert: 

<span class="alert" id="special-alert">sonething to be alerted to</span> 

Putting the foUowing in our style.css file will make the alert text red 
and bold: 

.alert { font-weight: bold; color: red } 

If we then add this to the style, css, the id color black wiU override 
the class color red, while the class font-weight remains bold: 

#speclal-alert {background: yellow; coloriblack} 

To enforce the color red for alerts, we can use the hmportant direc- 
tive:^ 

.alert { font-weight: bold; color: red Hmportant } 

If we then add another stylesheet, style2.css, after style.css: 

<link rel="stylesheet" href="style.css" type="text/css" /> 

<link rel="stylesheet" href="style2.css" type="text/css" /> 

With style2.css containing the foUowing: 

.alert { font-weight: normat } 

Then the font-weight of the alert will be reverted to normal, the new 
class style having been declared last. 

JavaScript 

JavaScript is the only flrst-class, browser-based programming lan- 
guage. In order to do anything remotely advanced (and that 
includes all modern web-visualizations) you should have a Java¬ 
Script grounding. Other languages which claim to make client-side/ 
browser programming easier, such as Typescript, Coffeescript etc., 
compile to JavaScript, which means debugging either uses (generally 
flaky) mapping files or involves understanding the automated Java¬ 
Script. 99% of all web visualisation examples, the ones you should 
aim to be learning from, are in JavaScript and voguish alternatives 
have a way of fading with time. In essence, good competence in, if 


2 This is generally considered bad practice and usually an indication of poorly structured 
CSS. Use with extreme caution as it can make life very difficult for co-developers. 
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not mastery of JavaScript is a pre-requisite for interesting web- 
visualisations. 

The good news for Pythonistas reading is that JavaScript is actually 
quite a nice language, once youVe tamed a few of its more awkward 
quirks^. As I showed in Chapter 2, JavaScript and Python have a lot 
in common and ifs usually easy to translate from one to the other. 

Data 

The data needed to fuel you web visualisation will be provided by 
the web-server as static files (e.g. JSON or CSV files) or dynamically, 
through some kind of web-API (e.g. RESTful APIs) usually retriev- 
ing the data server-side from a database. We’ll be covering all these 
forms in ???. 

Although a lot of data used to be delivered in XML form, modern 
web-visualisation is predominantly about JSON and, to a lesser 
extent, CSV or TSV files. 

JSON (short for JavaScript Object Notation) is the de-facto web- 
visualisation data Standard and I recommend you learn to love it. It 
obviously plays very nicely with JavaScript but its structure will also 
be familiar to Pythonistas. As we saw in “JSON” on page 83, reading 
and writing JSON data with Python is a snip. Heres a little example 
of some JSON data: 

{ 

"firstNane": "Groucho", 

"lastNane": "Marx", 

"sibltngs": ["Harpo", "Chtco", "Gumno", "Zeppo"], 

"nationality" : "American", 

"yearOfBirth": 1890 

} 

Chrome's DeveloperTools 

The arms-race in JavaScript engines in recent years, which has pro- 
duced huge increases in performance, has been matched by an 
increasingly sophisticated range of development tools built in to the 
various browers. Firefoxs Firebug lead the pack but for a while now 


2 These are succinctly discussed in Douglas Crockfords famously short JavaScript the 
Good Parts 
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Chromes Developer Tools have surpassed it, adding functionality all 
the time. Theres now a huge amount you can do with Chromes tab- 
bed tools but here TU introduce the two most useful tabs, the HTML 
+CSS focused Elements and the Javscript focused Sources. Both of 
these work in complement to Chromes developer console, demon- 
strated in “JavaScript” on page 36. 


The Elements Tab 

To access the Elements tab, select More Tools > Developer Tools from 
the right-hand options menu or use the Ctrl-Shift-I keyboard short- 
cut. 


Figure 4-6 shows the elements tab at work. You can select DOM- 
elements on the page by using the left-hand magnifying glass and 
see their HTML-branch in the left-panel. The right panel allows you 
to see CSS styles applied to the element and look at any event- 
listeners attached or DOM-properties. 


^ <5 D locaIhosCSOSO 


<» © = 


A Dummy Chart 
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► <head>_</head> 
T <body> 


<circle r='‘15" cx="160‘ cy="56‘x/circle> 
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Figure 4-6. Chrome Developer Tools Elements tab 
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One really cool feature of the Elements tab is that you can interac- 
tively change element styling, both CSS styles and attributas^. This is 
a great way to refine the look and feel of your data visualisations. 

Chromes Elements tab provides a great way to explora the structure 
of a page, finding out how the different elements are positioned. 
This is good way to get your head around positioning content blocks 
with the position and float properties. Seeing how the pros apply 
CSS-styles is a really good way to up your game and learn some use- 
ful tricks. 

TheSources Tab 

The Sources tab allows you to see any JavaScript included in the 
page. Figure 4-7 shows the tab at work. In the left-panel you can 
select a script or an HTML-file with embedded <script> tagged 
JavaScript. As shown, you can place a breakpoint in the code, load 
the page and, on break, see the calTstack, any scoped or global vari- 
ables etc.. These breakpoints are parametric so you can set condi- 
tions for them to trigger, handy if you want to catch and step 
through a particular configuration. On break you have the Standard 
to step in, out and over functions etc.. 


2 Being able to play with attributes is particularly useful when trying to get Scalable Vec¬ 
tor Graphics (SVG) to work. 


Chrome'sDeveloperTools | 121 




o = 


4 - <5 D localhost:8080 


A Dummy Chart 



Q Elements NetWork l Sources i Timeline Profiles Resources Audits Console '> FI x 


I Sources | Content Scripts 

► © (no domain) 

© cdnjs.cloudflare.com 

► © code.jquery.com 

► © d3js.org 

▼ © localhost:8080 
(index) 


R scripLjs 


[0| scripLjs X 1 0] 

II /TV t t 1 

O Servino from the file svste more Never show x 

► Watch + C 

1 


▼ CallStdck Async 

2 

functlon buildChartl){ 1 

Not Paused 

3 

4 

var padding - 26; | 

var helght = 156, width = 366; 

▼ Scope 

5 

) var chart = d3.selecti ’#charf ); 

Not Paused 

7 

8 

chart.appendi 'circle’ ) 

▼ Breakpoints 

9 

.attr( ■ r' , 15) 

scrjpt.js:6 

16 

.attrCcx'. 106) 

var chart = d3-. 

11 

12 

.attr( 'cy' , 50) ; ▼ 

► DOMBreakpoints 



► XHR Breakpoin^ ' 

"<V 

Line 1, Column 1 


Figure 4-7. Chrome Developer Tools Sources tab 


The Source tab is a fantastic resource and is the main reason why I 
hardly ever tum to console logging when trying to debug Javscript. 
In fact, where JS debugging was once a hit-and-miss black art, it is 
now almost a pleasure. 

Other Tools 

Theres a huge amount of functionality in those Chrome Developer 
Tools tabs and it’s being updated almost daily. You can do memory 
and CPU timelines and proflling, monitor your network downloads, 
test out your pages for different form-factors etc. But you’U spend 
99% of your time as a data visualiser in the Elements and Sources 
tabs. 


A Basic Page with Placeholders 

Now that we have covered the major elements of a web-page, lefs 
put them together. Most web-visualisations start off as HTML and 
CSS skeletons, with placeholder elements ready to be fleshed out 
with a little JavaScript plus data (see “Single-page Apps” on page 
106). 
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We’ll first need our HTML skeleton, using the code in Example 4-1. 
This consists of a tree of <div> content blocks, defining three chart- 
elements, a header, main and sidebar section. 

Example 4-1. The HTML skeleton 

<!DOCTYPE html> 

<neta charset="utf-8"> 

<link rel="stylesheet" href="style.css" type="text/css" /> 

<body> 

<div td="chart-holder" class='dev'> 

<div ld="header"> 

<h2>A Catchy Tltle Conlng Soon...</h2> 

<p>Sone body-text descrlbtng what thls visualisatlon Is ali 
about and why you should care.</p> 

</div> 

<div ld="chart-conponents"> 

<dlv ld="pialn"> 

A placeholder for the naln chart.. 

</div><div ld="sldebar"> 

<p>Sone useful infornatlon about the chart, 
probably changlng wlth user Interactlon.. .</p> 

</div> 

</div> 

</div> 

<script src="scrtpt . js"></script> 

</body> 

Now we have our HTML skeleton, we want to style it using some 
CSS. This wiU use the classes and ids of our content-blocks to adjust 
size, position, background color etc. To apply our CSS, in 
Example 4-1 we import a style.css file, shown in Example 4-2. 

Example 4-2. CSS styling 

body { 

background: #ccc; 
font-family : Sans-serlf; 

} 

div.dev { O 

border: solld Ipx red; 

} 
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dlv.dev dlv { 

border: dashed Ipx green; 

} 

dlv#chart-holder { 
wldth: 600px; 
background :whlte; 
margln: auto; 
font-size :16px; 

} 

dlv#chart-cornponents { 
height :400px; 
posltion : relative; & 

} 

div#naln, div#sldebar { 
posltion: absolute; & 

} 

dlv#nain { 

wldth: 75%; 
height: 100%; 
background: #eee; 

} 

dlv#sldebar { 
rlght: 0; O 
wldth: 25%; 
height: 100%; 

} 

O This dev class is a handy way to see the border of any visual 
blocks, useful for visualisation work. 

@ Makes chart-conponents the relative parent. 

© Makes the main and sidebar positions relative to chart- 
components. 

O Positions this block flush with the right wall of chart- 
components. 

We use absolute positioning of the main and siderbar chart elements 

(Example 4-2). There are various ways to position the content- 

blocks with CSS but absolute positioning gives you explicit control 

over their placement, a must if you want to get the look just right. 
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After specifying the size of the chart-components Container, the 
maln and sidebar child elements are sized and position using per- 
centages of their parent. This means any changes to the size of 
chart-components willbe reflected in its children. 

With our HTML and CSS defined we can examine the skeleton by 
firing up Pythons single-line SimpleHTTPServer in the project 
directory, like so: 

$ python -m SimpleHTTPServer 
Serving HTTP on 0.0.0.0 port 8000 ... 

Figure 4-8 shows the resulting page with the Elements tab open, dis- 
playing the pages DOM-tree. 



Figure 4-8. Building a Basic Webpage 

The charfs content-blocks are now positioned and sized correctly, 
ready for JavaScript to add some engaging content. 

Filling the placeholders with content 

With our content blocks defined in HTML and positioned using 
CSS, a modern data visualisation uses JavaScript to construet its 
Interactive charts, menus, tables and the like. There are many ways 
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to create visual content (aside from image or multimedia tags) on 
your modern browser, the main ones being: 

• Scalable Vector Graphics (SVG) using special HTML-tags. 

• Drawing to a 2D canvas context. 

• Drawing to a 3D canvas WebGL context, allowing a subset of 
OpenGL commands. 

• Using modern GSS to create animations, graphic primitives etc. 

Because SVG is the language of choice for D3, by some way the big- 
gest Javscript dataviz library, many of the cool web data- 
visualisations you have seen, such as those by the New York Times, 
are built in that. Broadly speaking, unless you anticipate having lots 
(>1000) of moving elements in your visualisation or need to use a 
specific canvas based library, SVG is probably the way to go. 

By using vectors instead of pixels to express its primitives SVG will 
generally produce ‘cleaner’ graphics that respond smoothly to scal- 
ing operations. Its also much better at handling text, a crucial con- 
sideration for many visualisations. Another key advantage of SVG is 
that user interaction (e.g. mouse hovering or clicking) is native to 
the browser, being part of the Standard DOM event handling^. A 
fmal point in its favour is that because the graphic components are 
built on the DOM, you can inspect and adapt them using your 
browsers development tools (see “Ghromes Developer Tools” on 
page 119). This can make debugging and refining your visualisa¬ 
tions much easier than trying to find errors in the canvas’s rela- 
tively black box. 

canvas graphics contexts come into their own when you need to 
move beyond simple graphic primitives like circles and lines, for 
example incorporating images, such as pngs or jpegs. canvas is usu- 
ally considerably more performant than SVG so anything with lots 
of moving elements^ is better off rendered to a canvas. If you want to 
be really ambitious or move beyond 2D graphics, you can even 
unleash the awesome power of modern graphics cards by using a 
special form of canvas context, the OpenGL-based “webgl” context. 


2 With a canvas graphic context you generally have to contrive your own event handling. 

3 What this number is changes with time and the browser in question but as a rough rule 
of thumb the low thousands is where SVG often starts to strain. 
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Just bear in mind that what would be simple user interaction with 
SVG (e.g clicking on a visual element) often has to be derived from 
mouse-coordinates manuaUy etc. adding a tricky layer of complex- 
ity 

The NobeTprize data-visualisation realised at the end of this books 
tool-chain is budt primarily with D3 so SVG graphics are focus of 
this book. Being comfortable with SVG is fundamental to modern 
web-based dataviz so let s take a little primer. 

Scalable Vector Graphics (SVG) 

It doesnt seem long ago that Scalable Vector Graphics (SVG) 
seemed aU washed up. Browser coverage was spotty and few big 
libraries were using it. It seemed inevitable that the canvas tag 
would act as a gateway to fuU-fledged, rendered graphics based on 
leveraging the awesome power of modern graphics cards. Pixels not 
vectors would be the building block of web-graphics and SVG 
would go down in history as a valiant but ultimately doomed ‘nice 
idea. 

D3 might not single-handedly have rescued SVG in the browser but 
it must take the lions share of responsibility. By demonstrating what 
can be done by using data to manipulate or drive the web-pages 
DOM it provided a compelling use-case for SVG. D3 reaUy needs its 
graphic primitives to be part of the document hierarchy, in the same 
domain as the other HTML content. In this sense it needed SVG as 
much as SVG needed it. 

The svg element 

All SVG creations start with an <svg> root tag. All graphical ele- 
ments such as circles, lines etc. and groups thereof are defined on 
this branch of the DOM-tree. Example 4-3 shows a little SVG con- 
text we’ll use in upcoming demonstrations, a light-gray rectangle 
with id chart. We also include the D3 library, loaded from d3js.org 
and a script. js JavaScript file in the project folder. 

Example 4-3. A basic SVG contexi 

<!DOCTYPE html> 

<neta charset="utf-8"> 

<!-- A /ew C5S style-rules --> 
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<style> 

svg#chart { 
background: lightgray; 

} 

</style> 

<svg id='chart' width="300" height="225"> 

</svg> 

</-- Third-party libraries and our JS script. --> 
<script src="http; //d3 js.org/d3.v3.nin . js"></script> 
<script src="script. js"></script> 


Now weVe got our little SVG canvas in place, let’s start doing some 
drawing. 

Thegelement 

We can group shapes within our svg element by using the group g 
element. As we’ll see in “Working with groups” on page 137, shapes 
contained in a group can be manipulated together, e.g. changing 
their position, scale or opacity. 

Circles 

To create SVG visualisations, from the humblest little static bar- 
chart to full-fledged Interactive, geographic masterpieces, involves 
putting together elements from a fairly small set of graphical primi- 
tives such as lines, circles and the very powerful paths. Each of these 
elements will have its own DOM-tag which will update as it 
changes^. e.g. e.g. ifs x and y attributes will change to reflect any 
translations within its svg or group (g) context. 

Lefs add a circle to our SVG context to demonstrate: 

<svg id='chart' width="300" helght="225"> 

<circle r="15" cx="100" cy="50"></circle> 

</svg> 

This produces Figure 4-9. Note that the y-coordinate is measured 
from the top of the svg #chart Container, a common graphic conven- 
tion. 


2 You should be able to use your browsers development tools to see the tag attributes 
updating in real time. 


128 I Chapter4:Webdevl01 





4 c 

D localhost:8080 



cy=50 

cx=100 




r-15 


Figure 4-9. An SVG circle 


Now let’s see how we go about applying styles to SVG elements. 

Applying CSS-styles 

The circle in Figure 4-9 is fill-colored lightblue using CSS styling 
rules: 

#chart circle{ fili: lightblue } 

In modern browsers, most visual SVG styles can be set using GSS, 
including fili, stroke, stroke-width and opacity. So if we wanted a 
thick, semi-transparent green line (with id total) we could use the 
following GSS: 

#chart line#total { 
stroke: green; 
stroke-width: 3px; 
opacity: 0.5; 

} 

You can also set the styles as attributes of the tags, though GSS is 
generally preferable. 

<circle r="15" cx="100" cy="50" fill='Tightblue"></circle> 
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Which SVG features can be set by CSS and 
which can’t is a source of some confusion and 
plenty of gotchas. The SVG spec distinguishes 
^ between element properties and attributes, the 
former berng more likely to be found among the 
valid GSS styles. You can investigate the valid 
GSS properties using Chromes Elements tab and 
it’s auto-complete. Also, be prepared for some 
surprises. For example, SVG text is colored 
using the fili not color property. 

For fill and stroke, there are various color conventions you can use: 

• named HTML colors, such as lightblue 

• using HTML hex-codes (#RRGGBB), e.g. white #FFFFFF 

• RGB values, e.g. red = rgb(255, 0, 0) 

• RGBA values, where A is an alpha-channel (0-1), e.g half- 
transparent blue rgba(0, 0, 255, 0.5) 

As weU as adjusting the colors alpha-channel with RGBA, the SVG 
elements can be faded using their opacity property. Opacity is used 
a lot in D3 animations. 

Stroke-width is measured in pixels by default but can use points etc.. 

Lines, rectangles, polygons 

We’U add a few more elements to our chart to produce Figure 4-10. 

First we’ll add a couple of simple axis-lines to our chart, using the 
<llne> tag. Line positions are defined by a start coordinate (xl, yl) 
and an end one (x2, y2): 

<line xl="20" yl="20" x2="20" y2='T30"></line> 

<Une xl="20" yl="130" x2="280" y2="130"></line> 

We’U also add a dummy legend-box in the top-right corner using an 
SVG rectangle. Rectangles are defined by x and y coordinates rela¬ 
tive to their parent Container and a width and height: 

<rect x="240" y="5" wldth="55'' helght="30"></rect> 

You can create irregular polygons using the <polygon> tag, which 
takes a list of coordinate pairs. Lefs make a triangle marker in the 
bottom right of our chart: 
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<polygon polnts="210,100, 230,100, 220,80"></polygon> 

We’ll Style the elements with a little CSS: 

#chart circle {fili: lightblue} 

#chart line {stroke: #555555; stroke-width : 2} 

#chart rect {stroke: red; fili: white} 

#chart polygon {fili: green} 


^ C □ localhost:8080 



Figure 4-10. Addingafew elements to our dummy-chart 

Now weVe a few graphical primitives in place, lets see how we add 
some text to our dummy-chart. 

Text 

One of the key strengths of SVG over the rasertized canvas context 
is how it handles text. Vector-based text tends to look a lot clearer 
than its pixellated counterparts and benefits from smooth scaling 
too. You can also adjust stroke and fiU properties, just like any SVG 
element. 

Lets add a bit of text to our dummy-chart, a title and labelled y-axis 
(see Figure 4-11). 

Text is placed using x and y coordinates. One important property is 
the text-anchor which stipulates where the text is placed relative to 
its X position. The options are start, middle and end, start being the 
default. 
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We can use the text-anchor property to center our chart title. We 
set the X coordinates at half the chart width and then set the text - 
anchor to middle. 

<text id="title" text-anchor="ntddle'' x="150" y="20"> 

A Dunny Chart 

</text> 

Like all SVG primitives, we can apply scaling and rotation trans- 
forms to our text. To label our y-axis we’ll need to rotate the text to 
the vertical (Example 4-4). By convention, rotations are clockwise by 
degree so we’ll want an anti-clockwise, -90deg. rotation. By default 
rotations are about the (0,0) point of the elements Container (svg or 
group g). We want to rotate our text about its own position so first 
translate the rotation point using the extra arguments to the rotate 
function. We also want to first set the text-anchor to the end of the 
‘y axis label’ string to rotate about its end point. 

Example 4-4. Rotating text 

<text x="20" y="20" transforn="rotate( -90,20,20)" 

text-anchor="end" dy="0.71em">y axis label</text> 

In Example 4-4 we make use of the texfs dy attribute which, along 
with dx can be used to make fine adjustments to the texfs position. 
In this case we want to lower it so that when rotated anti-clockwise 
it will be to the right of the y-axis. 

SVG text elements can also be styled using GSS. Here we set the 
font-family of the chart to sans-serif and the font-size to 16px, 
using the title id to make that a little bigger: 

#chart { 

background: #eee; 
font-family: sans-serif; 

} 

#chart text{ font-size: 16px } 

#chart text#title{ font-size: 18px } 
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Figure 4-11. Some SVG text 

Note that the text elements inherit font-family and font-size from 
the charts CSS - you dont have to specify a text element. 

Paths 

Paths are the most complicated and powerful SVG element, enabling 
the creation of multi-line, multi-curve component paths which can 
be closed and filled, creating pretty much any shape you want. A 
simple example is adding a little chart-line to our dummy-chart to 
give Figure 4-12. 

The red path in Figure 4-12 is produced by the foUowing SVG: 

<path d="M20,130L60,70L110,100L160,45"></path> 

The path’s d attribute specifies the series of operations needed to 
make the red-line. Lefs breakit down: 

1. “M20, 130” - move to coordinate (20, 130) 

2. “L60, 70” - draw a line to (60, 70) 

3. “LllO, 100” - draw a line to (110, 100) 

4. “L160, 45” - draw a line to (160, 45) 

You can imagine d as a set of instructions to a pen to move to a 
point with M raising the pen from the canvas. 
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A little CSS-styling is needed. Note that the fili is set to none; 
otherwise, to create a fill-area, the path would be closed, drawing a 
line from the its end to beginning points, and any enclosed areas fil- 
led in with the default color black: 

#chart path {stroke: red; fili: none} 



Figure 4-12. A red line-path from the chart axis 

As well as the moveto ‘M’ and lineto ‘U the path has a number of 
other commands to draw ares, Bezier curves etc.. SVG ares and 
curves are commonly used in dataviz work, with many of D3’s libra- 
ries making use of them^. Figure 4-13 shows some SVG elliptical- 
ares created by the following code: 

<svg ld='chart' width="300" helght="150"> 

<path d="M40,40 

A30,40 O 

, 0 , 0 , 1 , @ 

80,80 

A50,50 ,0,0,1, 160, 80, 

A30,30 ,0,0,1, 190, 80 

"> 

</svg> 

O Having moved to position (40, 40), draw an elliptical-arc x- 
radius 30 and y-radius 40 and end-point (80, 80). 


2 This chord-diagram is a nice example, using D3s chord function. 
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@ The last two flags (0, 1) are large-arc-flag, specifying which 
arc of the ellipse to use and sweep-flag which specifies which 
of the two possible ellipses deflned by start and end-points to 
use. 



Figure 4-13. Some SVG Elliptical-arcs 

The key flags used in the elliptical arc, large-arc-flag and sweep- 
flag are, like most things geometric, better demonstrated than 
described. Figure 4-14 shows the effect of changing the flags for the 
same relative beginning and end points, like so: 

<svg ld='chart' wtdth="300" helght="150"> 

<path d="M40,80 

A30,40 ,0,0,1, 80,80 
A30,40 ,0,0,0, 120, 80, 

A30,40 ,0,1,0, 160, 80 
A30,40 ,0,1,1, 200, 80 

"> 

</svg> 
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As well as lines and ares, the path element offers a number of Bezier 
curves, quadratic, cubic and compounds of the two. With a little 
work these can realise any line-path you want. Theres a nice run- 
through here with good illustrations. 

For the definitive list of path elements and their arguments go here 
to the w3 source. And for a nice round-up see Jakob Jenkovs intro- 
duction 

Scaling and rotating 

As befits their vector nature, ali SVG elements can be transformed 
by geometric operations. The most common used are rotate, trans¬ 
late and scale but you can also apply skewing using skewX and skewY 
or use the powerful, multi-purpose matrix transform. 

Lefs demonstrate the most popular transforms, using a set of identi- 
cal rectangles. The transformed rectangles in Figure 4-15 are 
achieved like so: 

<svg ld='chart' wtdth="300" helght="150"> 

<rect wldth="20" height="40'' transform="translate(60, 55)" 

flll='blue'/> 

<rect wldth="20" helght="40" transform="translate(120, 55), 
rotate(45)" ftll= 'blue' /> 

<rect wldth="20" helght="40" transform="translate(180, 55), 
scate(0.5)" flll= 'btue' /> 

<rect wldth="20" height="40" transform="translate(240, 55), 
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</svg: 


rotate(45),scale(0.5)" flll= 'blue' /> 


^ C D localhost:8080 


■ ^ ^ 


Figure 4-15. Some SVG transforms: rotate(45), scale(0.5), scale(0.5) + 
rotate(45) 



The order in which transforms are applied is 
important. A rotation of 45 deg. clockwise foO- 
lowed by a translation along the x-axis wUl see 
the element moved south-easterly whereas the 
reverse operation moves it to the left and then 
rotates it. 


Working with groups 

Often when constructing a visualisation its helpful to group the vis- 
ual elements. A couple of particular uses being: 

• When you require local coordinate schemes, e.g. if you have a 
text label for an icon you want to specify its position relative to 
the icon not the whole svg canvas. 

• If you want to apply a scaling and/or rotation transformation to 
a sub-set of the visual elements. 
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SVG has a group <g> tag for this which you can think of as a mini 
canvas within the svg canvas. Groups can contain groups, allowing 
for very flexible geometric mappings^. 

Example 4-5 groups some shapes in the center of the canvas, pro- 
ducing Figure 4-16. Note that the position of circle, rect and path 
elements is relative to the translated group. 

Example 4-5. Groupingsome SVG shapes 

<svg ld='chart' width='300' helght= '150' > 

<g id='shapes' transforpi= ' translate(150,75)' > 

<circle cx='50' cy='0' r='25' ftll='red' /> 

<rect x='30' y='10' wldth='40' height='20' flll='blue' /> 

<path d='M-20,-10L50,-10L10,60Z' fUt='green' /> 

<circle r='10' fill='yellow'> 

</g> 

</svg> 



Figure 4-16. Groupingsome shapes with SVG <g> tag 

If we now apply a transform to the group, ali shapes within it will be 
affected. Figure 4-17 shows the resuit of scaling Figure 4-16 by a fac¬ 
tor of 0.75 and then rotating it 90, achieved by adapting the trans¬ 
form attribute, like so: 


2 For example, a body group can contain an arm group can contain a hand group can 
contain finger elements. 
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<svg ld='chart' wldth="300" helght="150"> 

<g ld= 'shapes' , 

transforn = 'translate(150,75),scale(0.5),rotate(90)' > 

</svg> 


4 C D localhost:8080 
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Figure 4-17. Transforming an SVGgroup 

Layering and transparency 

The order in which the SVG elements are added to the DOM-tree is 
important, with later elements taking precedence, layering over oth- 
ers. In Figure 4-16, for example, the triangle path obscures the red 
circle and blue rectangle and is in turn obscured by the yellow circle. 

Manipulating the DOM ordering is an important part of JavaScrip- 
ted dataviz, e.g. D3’s insert method allows you to place an SVG ele- 
ment before an existing one. 

Element transparency can be manipulated using the alpha-channel 
of rgba(R,G,B,A) colors or the more convenient opacity property. 
Both can be set using GSS. For overlaid elements, opacity is cumula- 
tive, as demonstrated by the color triangle Figure 4-18, produced by 
the following SVG: 

<style> 

#chart circle { opacity: 0.33 } 

</style> 

<svg id='chart' width="300" height="150"> 

<g transforn='translate(150, 75) '> 
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<circle cx='0' cy='-20' r='30' fill='red'/> 
<circle cx='17.3' cy='10' r='30' flll= 'green' /> 
<circle cx='-17.3' cy='10' r='30' fill='blue'/> 

</g> 

</svg> 



4 c 

D localhost:8080 


T 


Figure 4-18. Manipulating opacity with SVG 


The SVG elements demonstrated above were hand-coded in HTML 
but in data-visualisation work they are almost always added pro- 
grammatically. Thus the basic D3 workflow is to add SVG elements 
to a visualisation, using data-files to specify their attributes and 
properties. 

JavaScripted SVG 

The fact that SVG graphics are described by DOM-tags has a num- 
ber of advantages over a black-box such as the <canvas> context. 
For example, it aUows non-programmers to create or adapt graphics 
and is a boon for debugging. 

In web dataviz pretty much aU your SVG elements will be created 
with JavaScript, using a library such as D3. You can inspect the 
results of this scripting using the browsers Elements tab (“Ghromes 
Developer Tools” on page 119), which is a great way to refine and 
debug your work, e.g. nailing an annoying visual glitch. 

As a little taster for things to come, lets use D3 to scatter a few red 
circles on an SVG canvas. The dimensions of the canvas and circles 
are contained in a data object sent to a chartCircles function. 
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We use a little HTML place-holder for the SVG element: 

<!D0CTYPE htfnl> 

<neta charset="utf-8"> 

<style> 

#chart circle {fili: red} 

</style> 

<body> 

<svg id='chart'> 

<script src="http: //d3js.org/d3.v3.Piin. js"></script> 

<script src="script . js"></script> 

</body> 

With our place-holder SVG chart element in place, a little D3 in the 
script. js file is used to turn some data into the scattered circles 
(see Figure 4-19): 

// script.js 

var chartCircles = function(data) { 

var chart = d3.select('#chart'); 

// Set the chart height and width fron data 

chart.attr('height', data.height).attr('width', data.width); 

// Create sone circles using the data 

chart.selectAll('circle').data(data.circles) 

.enterO 

.append('circle') 

.attr('cx', function(d) { return d.x }) 

.attr('cy', function(d) { return d.y }) 

.attr('r', function(d) { return d.r }); 

}; 

var data = { 

width: 300, height: 150, 
circles: [ 

{'x': 50, 'y': 30, 'r': 20}, 

{'x': 70, 'y': 80, 'r': 10}, 

{'x': 160, 'y': 60, 'r': 10}, 

{'x': 200, 'y': 100, 'r': 5}, 

] 

}; 


chartCircles(data); 
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Figure 4-19. Some D3-generated circles 


We’ll see exactly how D3 works its magic in ???. For now, lets sum- 
marise what weVe learned in this chapter. 

Summary 

This chapter provided a basic set of modern web-development skills 
for the budding data-visualiser. It showed how the various elements 
of a web-page (the HTML, CSS-stylesheets, JavaScript and media- 
files) are delivered by HTTP and, on being received by the browser, 
combined into the web-page the user sees. We saw how content 
blocks are described, using HTML tags such as div and p, and then 
styled and positioned using CSS. We also covered Chromes Ele¬ 
ments and Source tabs, the key browser development tools. Finally 
we had a little primer in SVG , the language in which most modern 
web data-visualisations are expressed. These skiUs will be extended 
when our toolchain reaches its D3 visualisation and new ones intro- 
duced in context. 
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PARTII 


Getting Your Data 


In this part of the book we start our journey along the dataviz toolchain 
(see Figure II-1), beginning with a couple of chapters on how to get 
your data if it hasn t been provided for you. 

In Chapter 5 we see how to get data off the web, using Pythons 
requests library to grab web-based files and consume RESTful 
APIs. We also see how to use a couple of Python libraries that wrap 
more complex web-APIs, namely Twitter (with Pythons Tweepy) 
and Google docs. The chapter ends with an example of light-weight 
web-scraping with the BeautifulSoup library. 

In Chapter 6 we use Scrapy, Pythons industrial-strength web- 
scraper, to get the Noble prize data-set we’ll be using for our web- 
visualisation. With this dirty data-set to hand, we’re ready for the 
next part of the book ???. 




Figure 11-1. Our dataviz toolchain: Getting the data 









CHAPTER 5 


Getting Data off the Web with 

Python 


A fundamental part of the data-visualisers skill-set is getting the 
right data-set, in as clean a form as possible. And more often than 
not these days, this involves getting it off the web. There are various 
ways you can do this and Python provides some great libraries 
which make sucking up the data easy. 

The main ways to get data off the web are: 

• Get a raw data-file over HTTP. 

• Use a dedicated API to get the data. 

• Scrape the data by getting web pages by HTTP and parsing 
them locally for content. 

This chapter will deal with these ways in turn, but flrst lefs get 
acquainted with the best Python HTTP library out there, requests. 

Getting Web-data with the requests library 

As we saw in Chapter 4, the files that are used by web-browsers to 
construet web-pages are communicated using the Hypertext Trans¬ 
fer Protocol, HTTP, flrst developed by Tim Berners Lee. Getting 
web-content, in order to parse it for data involves making HTTP 
requests. 
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Negotiating HTTP requests is a vital part of any general purpose 
language but getting web-pages with Python used to be a rather irk- 
some affair. The venerable urlllb2 library was hardly user-friendly, 
with a very clunky API. Requests, courtesy of Kennith Reitz, 
changed that, making HTTP a relative breeze and fast establishing 
itself as the go-to Python HTTP library. 

Requests is not part of the Python standard-library^ but is part of 
the Anaconda package (see Chapter 1). If youre not using Ana- 
conda, the following ptp command should do the job: 

$ ptp Instatl requests 
Downtoading/unpacking requests 

Cteanlng up... 

If youre using a Python version prior to 2.7.9 then using requests 
may generate some Secure Sockets Layer (SSL) warnings. Upgrading 
to newer SSL libraries should fix this^: 

$ ptp tntatt --upgrade ndg-httpscltent 

Now that you have requests installed, youre ready to perform the 
first task mentioned at the beginning of this chapter and grab some 
raw data-files off the web. 

Getting Data-files with requests 

A Python interpreter session is a good way to put requests through 
its paces so flnd a friendly local command-line, fire up IPython and 
import requests: 

$ tpython 

Python 2.7.5+ (defautt, Feb 27 2014, 19:37:08) 


In [1]: inport equests 

To demonstrate, lefs use the library to download a Wikipedia page. 
We use the requests librarys get method to get the page and, by 
convention, assign the resuit to a response object. 


2 This is actually a deliberate policy of the developers. 

3 There are some platform dependencies that might stili generate errors. This Stackover- 
flow thread is a good starting point if you stili have problems. 
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response = requests.get( "https://en.wikipedla.org/wlki/Nobel_Prize" ) 


Lets use Pythons dir method to get a list of the response objects 
attributas: 

dir(response) 

Out: 

'content' , 

'cookies' , 

'elapsed' , 

'encodtng' , 

'headers' , 

'iter_content' , 

'iter_lines' , 

'json' , 

'Itnks' , 

'status_code' , 

'text' , 

'uri'] 

Most of these attributas are self-explanatory and together provide a 
lot of information about the HTTP response generated. You’11 use a 
small subset of these attributas generally. Firstly, lefs check the status 
of the response: 

response . status_code 
Out: 200 

As ali good minimal web-dewers know, 200 is the HTTP status 
code for OK, indicating a successful transaction. Other than 200, the 
most common codes are: 

• 401 (Unauthorized): attempting unauthorized access. 

• 400 (Bad Request): trying to access the web-server incorrectly 

• 403 (Forbidden): similar to 401 but no login opportunity was 
available. 

• 404 (Not Found): trying to access a web-page that doesnt exist. 

• 500 (Internal Server Error): a general-purpose, catch-all error. 

So, for example, if we made a spelling mistake with our request, ask- 
ing to see the SNoble_Prize page, wed get a 404 (Not Found) error: 

response = requests.get ( "http://en .wiklpedla.org/wikt/SNobel_Prize" ) 
response . status_code 
Out: 404 
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With our 200 OK response, from the correctly-spelled request, lets 
look at some of the info returned. A quick overview can be had with 
the headers property: 

response.headers 
Out: { 

'X-Cllent-IP' : '104.238.169.128' , 

'Content-Length' : '65820', ... 

'Content-Encoding' : 'gzip', ... 

'Last-Modified' : 'Sun, 15 Nov 2015 17:14:09 CMT', ... 

'Date': 'Mon, 23 Nov 2015 21:33:52 CMT', 

'Content-Type' : 'text/htnl; charset=UTF-8' ... 

} 

This shows, among other things, that the page returned was gzip 
encoded and 65k in size with content-type of text/html, encoded 
with Unicode UTF-8. 

Since we know text has been returned, we can use the text property 
of the response to see what it is: 

response.text 

Out: u'<!D0CTYPE html>\n<htnil lang="en" 

dir="ltr" class="client-nojs">\n<head>\n<neta charset="UTF-8" 
/>\n<tltle>Nobel Prlze - Wlkipedla, the free 

encyclopedta</tltle>\n<script>docunent . documentElenent . className 

This shows we do indeed have our Wikipedia HTML page, with 
some inline JavaScript. As we’ll see in “Scraping Data” on page 160, 
in order to make sense of this content we’11 need a parser to read the 
HTML and provide us with the content-blocks. 

requests can be a convenient way of getting web-data into your 
program or Python session. For example, we can grab one of the 
datasets from the huge US government catalog, which often has the 
choice of various file-formats, e.g. JSON or CSV. Picking randomly, 
heres the data from a 2006-2010 study on food affordability, in 
JSON format. Note that we check that it has been fetched correctly, 
with a status_code 200: 

response = requests.get( 

"https://cdph.data.ca.gov/api/vtews/6tej-5zx7/rows.json\ 
?accessType=DOWNL0AD" ) 

response . status_code 
Out: 200 
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For JSON data, requests has a convenience method, allowing us to 
access the response data as a Python dictionary. This contains meta- 
data and a list of data-items: 

data = response.json( ) 
data . keys() 

Out: 

[u 'neta' , u'data' ] 

data[ 'neta' ][ 'view' ][ 'descrtption' ] 

Out: u'Thls table contains data on the average cost of a 
narket basket of nutrltlous food itens relative to Incone for 
fenale-headed households wlth chlldren, for Callfornla, Its 
reglons, countles, and cltles/towns . The ratio uses data fron 
the U.S. Department of Agrlculture 

data[ 'data' ] [0] 

Out: 

[1, 

u'4303993D-76F7-4A5C-914E-FDEA4EAB67BA' , 

u'Food affordablllty for fenale-headed household wlth 

chlldren under 18 years', 

u'2006-2010' , 

u'l' , 

u'AIAN' , 

u'CA' , 

u' 06' , 

u'Callfornla' , ... 

Now weVe grabbed a raw page and a JSON file off the web, let’s see 
how to use requests to consume a web data-API. 

Using Python to Consume Data from a Web- 
API 

If the data-file isnt on the web and you are lucky, rather than having 
to scrape some data configured for human consumption, there will 
be an Application Programming Interface (API) that enables you get 
the data programmatically and hopefully in a form that is cleaner 
and better organized (e.g. getting Twitter tweets from the official 
Twitter API). 

The most popular data formats for web-APIs are JSON and XML, 
though a number of esoteric formats exist. For the purposes of the 
JavaScripting data-visualiser (discussed in ???), JavaScript Object 


Using Python to Consume Data from a Web-API | 149 




Notation (JSON) is obviously preferred. Helpfully, it is also starting 
to predominate. 

There are different approaches to creating a web-API and for a few 
years there was a little war of the architectures. Three main types of 
API inhabit the web: 

• REST: short for Representational state transfer, using a combi- 
nation of HTTP verbs (GET, POST etc.) and Uniform Resource 
Identiflers (URIs), e.g. /user/kyran, to access, create and adapt 
data. 

• XML-RPC: a remote procedure call (RPC) protocol using XML 
encoding and HTTP transport. 

• SOAP: short for Simple Object Access Protocol, using XML and 
HTTP. 

This battle seems to be resolving in a victory for RESTful APIs and 
this is a very good thing. Quite apart from RESTful APIS being more 
elegant, easier to use and implement (see ???), some standardization 
here makes it much more likely that you wiU recognize and quickly 
adapt to a new API that comes your way Ideally you will be able to 
reuse existing code. 

Most access and manipulation of remote data can be summed up by 
the acronymn CRUD (create, retrieve, update, delete) originally 
coined to describe all the major functions implemented in relational 
databases. HTTP provides CRUD counterparts with the POST, GET, 
PUT and DELETE verbs and the REST abstraction builds on this 
use of these verbs, acting on a Universal Resource Identifler (URI). 

Discussions about what is and isn’t a proper RESTful interface can 
get quite involved^ but essentially the URI (e.g. http:// 
example.com/api/items/ 2) should contain aU the Information 
required in order to perform a CRUD operation. The particular 
operation (e.g. GET or DELETE) is specified by the HTTP verb. 
This excludes architectures such as SOAP which place stateful Infor¬ 
mation in meta-data on the requests header. Imagine the URI as the 
Virtual address of the data and CRUD all the operations you can per¬ 
form on it. 


2 See Parkinsons law of triviality, also known as bike-shedding. 
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As data visualizers, keen to lay our hands on some interesting data- 
sets, we are avid consumers here, so our HTTP verb of choice is 
GET and the examples below will focus on the fetching of data with 
various well known web-APIs. Hopefully some patterns will emerge. 

Although the two constraints of stateless URIs and the use of the 
CRUD verbs is a nice constraint on the shape of RESTless APIs, 
there stili manage to be many varients on the theme. 

Using a RESTfuI Web-API with requests 

requests has a fair number of bells and whistles based around the 
main HTTP request verbs. For a good overview see here. For the 
purposes of getting data, you’11 use GET and POST pretty much 
exclusively with GET being by a long way the most used verb. POST 
allows you to emulate web-forms, including login details, field- 
values etc. in the request. For those occasions where you find your- 
self driving a web-form with, for example, lots of options selectors, 
requests makes automation with POST easy. GET covers pretty 
much everything else, including the ubiquitous RESTfuI APIs, which 
provide an increasing amount of the welTformed data available on 
the web. 

Lefs look at a more complicated use of requests, getting a URL 
with arguments. The Organisation for Economic Gooperation and 
Development (OEGD) provides some useful datasets on its site. The 
API is described here and queries are constructed using the dataset 
name (dsname), some dot-separated dimensions, each of which can 
be a number of + separated values. The URL can also take Standard 
HTTP parameters initiated by a ? and separated by &s: 

<root_url>/<dsnane>/<dln l>.<din 2>...<din n>/all?paranl=foo&paran2=baa.. 
<dln 1> = 'AUS'+'AUT'+'BEL'... 

So the following is a valid URL: 

http://stats.oecd.org/sdnx-json/data/QNA O 
/AUS+AUT.GDP+B1_GE.CUR+VOBARSA.Q @ 

/all?startTlne=2009-Q2&endTtne=2011-Q4 © 

O Specifles the QNA dataset. 

© Four dimensions, by location, subject, measure and frequency. 

© Data from the second quarter 2009 to fourth quarter 2011. 
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Lets construet a little Python function to query the OECDs API: 


Example 5-1. Makinga URL for the OECDAPI 

OECD_ROOT_LIRL = 'http://stats.oecd.org/sdnx-json/data' 

def nake_OECD_request(dsnarne, dtnensions, parans=None, \ 
root_dlr=OECD_ROOT_URL) : 

""" Make a URL for the OECO-API and return a response """ 

if not parans: O 
parans = {} 

dln_str = '.' . joln( '+' . join(d) for d in dtnensions) 
uri = root_dir + '/' + dsnane + '/' + din_str + '/all' 
printC Requesting URL: ' + uri) 
return requests.get(url, parans=parans) e> 

O You shouldnt use mutable values, such as {}, for Python func¬ 
tion defaults. See here for an explanation of this gotcha. 

@ Using a list comprehension with Pythons succinet string con¬ 
catenator jotn. 

& Note that requesfs get can take a parameter dictionary as its 
second argument, using it to make the URL-query string. 

We can use this function like so, to grab economic data for the USA 
and Australia from 2009-2010: 

response = nake_OECD_request('QNA' , 

(('USA', 'AUS' ),( 'GDP' , 'B1_CE' ),( 'CUR' , 'VOBARSA'), ('Q')), 
{'startTine' : '2009-Ql' , 'endTine' : '2010-Ql'}) 

Requesting URL: http://stats.oecd.org/sdnx-json/data/QNA/ 

USA+AUS.CDP+B1_CE.CUR+VOBARSA.Q/all 

Now to look at the data, we just check the response is OK and have a 
look at the dictionary keys: 

if response.status_code == 200: 
json = response.json() 
json.keysO 

Out: [u'header', u'dataSets', u'structure' ] 

The resulting JSON data is in the SDMX format, designed to facili¬ 
tate the communication of statistical data. Ifs not the most intuitive 
format around but ifs often the case that data-sets have a less than 
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ideal structure. The good news is that Python is a great language for 
knocking data into shape. For Pythons Pandas library (see ???) there 
is pandaSDMX which currently handles the XML based format. 

The OECD API is essentially RESTfuP with all the query being con- 
tained in the URL and the HTTP verb GET specifying a fetch opera- 
tion. If a specialised Python library isnt available to use the API, e.g. 
Tweepy for Twitter, then you’ll probably end up writing something 
like Example 5-1. requests is a very friendly, weU designed library 
and can cope with pretty much all the manipulations required to use 
a web-API. 

Getting some country data for the Nobel-viz 

There are some national statistics which will come in handy for the 
NobeTvisualisation we’re using our tooTchain to build. Population 
sizes, three-letter international codes (e.g. GDR, USA), geographic 
centers etc., are potentially useful when visualising an international 
prize and its distribution. REST-countries is a handy RESTful web- 
resource with various international stats. Lets use it to grab some 
data. 

Requests to REST-countries take the foUowing form: 

https://restcountrles.eu/rest/vl/<field>/<nane>?<params> 

As with the OEGD API (see Example 5-1), we can make a simple 
calling function to aUow easy access to the APTs data, like so: 

def REST_country_request(fleld, nane, parans=None) : 

headers = { 'User-Agent' : 'Mozllla/S.0' } O 

if not parans: 
parans = {} 

if field == 'all' : 

return requests.get(REST_EU_ROOT_URL + '/atl', headers=headers) 

uri = '%s/%s/%s'%(REST_EU_ROOT_URL, field, nane) 
print( 'Requesting URL: ' + uri) 

return requests.get(url, parans=parans, headers=headers) 

< 1 > 


2 SDMX is a RESTful specification. 
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With the Request_country_EU function to hand, let’s get a list of all 
the countries using the US-doUar as currency: 

response = REST_country_request( 'currency' , 'usd') 
if response.status_code == 200: # request OK 
response.json() 

Out: 

[{u 'alphaZCode' : u'AS', 
u'alphaBCode' : u'ASM', 
u'altSpelllngs' : [u'AS', 

u'capital': u'Pago Pago', 
u'currencles' : [u'LISD'], 
u'denonyn': u'Anerican Samoan', 

u'latlng': [12.15, -68.266667], 
u'nane' : u'Bonalre' , 

u'nane': u'Britlsh Indlan Ocean Terrltory', 
u'nane': u'United States Minor Outlying Islands', 

The fuU data-set at REST-countries is pretty small so for conve- 
nience we’11 make a copy and store it locally to MongoDB and our 
nobel-prize database using the get_mongo_database method from 
“MongoDB” on page 97: 

db_nobel = get_rnongo_database( ' nobel_prize' ) 

coi = db_nobel[ 'country_data' ] # country-data collectiori 

# Get all the RESTfui country-data 
response = REST_country_request( ) 
if response.status_code == 200: 

# Insert the JSON-objects straight to our collection 
col.insert(country_data) 

Out: 

[ObjectIdC '5665alef26a7110b79e88d49' ), 

0bjectld( '5665alef26a7110b79e88d4a' ), 

With our country-data inserted to its MongoDB collection, lets 
again find all the countries using the US-dollar as currency: 

res = coi.find({ 'currencles ':{' $in ':[' USD' ]}}) 
list(res) 

Out: 

[{u'_id' : ObjectId('S66Salef26a7110b79e88d4d' ), 
u'alpha2Code' : u'AS', 
u'alphaBCode' : u'ASM', 
u'altSpelllngs' : [u'AS', 
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u'currencies' : [u'USD'], 
u'denonyn': u'American Samoan', 
u'languages' : [u'en', u'sm'], 

Now that weVe rolled a couple of our own API consumers, lets take 
a look at some dedicated libraries that wrap some of the larger web 
APIs in an easy to use form. 

Using Libraries to access Web-APIs 

requests is capable of negotiating with pretty much ali web-APIs 
and often a little function like Example 5-1 is all you need. But as the 
APIs start adding authentication and the data structures become 
more complicated, a good wrapper-library can save a lot of hassle 
and reduce the tedious book-keeping. In this section TU cover a cou¬ 
ple of the more popular wrapper libraries to give you a feel for the 
workflow and some useful start-points. 

Using Google-spreadsheets 

Ifs becoming more common these days to have live data-sets in the 
cloud. So, for example, you might flnd yourself required to visualise 
aspects of a Google-spreadsheet which is the shared data-pool for a 
group. My preference is to get this data out of the Google-plex and 
into Pandas, to start exploring it (see ???) but a good library wiU let 
you access and adapt the data in-place, negotiating the web-trafflc as 
required. 

Gspread is the best known Python library for accessing Google- 
spreadsheets and makes doing so a relative breeze. 

You’11 need OAth 2.0 credentials to use the APP. The most up to 
date guide can be found here. Foliowing those instructions should 
provide a JSON file containing your private key 

You’11 need to install gspread and the latest Python OAuth2 client 
library. Here’s how to do it with pip. 

$ plp Install gspread 
$ plp Install --upgrade oathZcllent 


2 OAuthl access has been deprecated recently. 
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Depending on your system you may also need PyOpenSSL: 

$ plp Install PyOpenSSL 

See here for more details and trouble-shooting 



Googles API assumes that the spreadsheets you 
are trying to access are owned or shared by your 
API-account, not your personal one. The email- 
address to share the spreadsheet with is avaUable 
at your Google developers console and in the 
JSON credentials key needed to use the API. It 
should look something like account-liaMy 
Pro ject...iam. gservlceaccount. com. 


With those libraries installed you should be able to access any of 
your spreadsheets in a few lines. Tm using the Microbe-scope 
spreadsheet which you can see here. Example 5-2 shows how to load 
the spreadsheet. 

Example 5-2. Opening a Google-spreadsheet 

inport json 
Import gspread 

fron oauth2cl.ient.Client inport SignedOwtAssertionCredentials 

json_key = json.load(open('My Project-b8a....json' )) O 
scope = ['https://spreadsheets.google.con/feeds'] 

credentials = SignedOwtAssertionCredentials(json_key [ 'client_enail' ],\ 
json_key[ 'private_key' ] .encode() , scope) 

gc = gspread.authorize(credentials) 

ss = gc.open( 'Microbe-scope' ) 

O The JSON credentials file is the one provided by Google- 
Services. 

@ Here we’re opening the spreadsheet by name. Alternatives are 
open_by_url or open_by_id. See here for details. 

Now that weVe got our spreadsheet we can see what work-sheets it 
contains: 
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ss .worksheetsO 

Out: [<Worksheet 'bugs' ld:od6>, 

<Worksheet 'outrageous facts' i.d:o74cw7y>, 

<Worksheet 'physlcians per 1,000' ld:okzh6fp>, 

<Worksheet 'amends' ld:ogkk64p>] 

ws = ss .worksheetC bugs' ) 

With the worksheet bugs selected from the spreadsheet, gspread 
allows you to access and change column, row and cell values 
(assuming the sheet isnt read-only). So we can get the values in the 
second column with the col_values command: 

ws.col_values(l) 

Out: [None, 

'grey = not plotted', 

'Anthrax (untreated)' , 

'Btrd Ftu (H5N1)' , 

'Bubontc Plague (untreated)', 

'C.Difficile' , 

'Canpylobacter' , 

'Chicken Pox', 

'Cholera' ,.. . 

Although you can use gspread’s API to plot directly, using a plot- 
library like Matplotlib, I prefer to send the whole sheet to Pandas, 
Pythons powerhouse programmatic spreadsheet. This is easily 
achieved using gspread’s get_all_records, which returns a list of 
item dictionaries. This list can be used directly to initialise a Pandas 
DataFrane (see ???): 

df = pd.DataFrame(ws.get_all_records()) 
df.info() 

Out: 

<class ' pandas . core . frane . DataFrane '> 

Int64Index: 41 entries, 0 to 40 
Data colunns (total 23 colunns): 

average basic reproductive rate 
case fatality rate 
infectious dose 

upper R0 

viral load In acute stage 
yearly fatalities 
dtypes: object(23) 
nenory usage: 7.7+ KB 

In ??? we’ll see how to interactively explore a DataPrames data. 


41 non-null object 
41 non-null object 
41 non-null object 


41 non-null object 
41 non-null object 
41 non-null object 
41 non-null object 
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Using the Twitter API with Tweepy 

The advent of social media has generated a lot of data and an inter- 
est in visualising the social-networks, trending hashtags, media- 
storms etc. contained in it. Twitter’s broadcast network is probably 
the richest source of cool data-visualisations and its API provides 
tweets^ filtered by user, hashtag, date etc. 

Pythons Tweepy is an easy to use Twitter library which provides a 
number of useful features, such as a StreamLlstener class for 
streaming live twitter updates. To start using it you’11 need a Twitter 
access token, which can be acquired by following the instructions 
here to create your twitter application. Once this application is cre- 
ated you can get the keys and access tokens for your app by clicking 
on the link here. 

Tweepy typically requires the four authorisation elements shown 
here: 

# The user credential variables to access Twitter API 
access_token = "2677230157-Ze3bWuBAw4kwoj4via2dEntU86...TD7z" 
access_token_secret = "DxwKAv\/zMFLq7WnQCnty49jg339Acu.. .paRSZH" 
consumer_key = "pIorGFGQHShuYQtIxzYWkljMD" 

consumer_secret = "yLc4Flw82G0Zn4vTt4q8pSBcNyFlkn35BfIe.. .o\/a4P7R" 

With those defined, accessing tweets could hardly be easier. Here we 
create an OAuth auth object using our tokens and keys and use it to 
start an API session. We can then grab the latest tweets from our 
timeline: 

inport tweepy 

auth = tweepy.OAuthFlandler(consumer_key, consumer_secret) 
auth . set_access_token(access_token , access_token_secret) 

api = tweepy.API(auth) 

publtc_tweets = api . home_tlnellne() 
for tweet in public_tweets : 
print tweet.text 

RT @Glinner: Read these tweets https://t.co/Qqz3PsDxUD 
Volodynyr Bilyachat https://t.co/\/IyOHije6b +1 bneyer 
#javascript 

RT @bbcworidservice: If scientists edit genes to 


2 The free API is currently limited to around 350 requests per hour. 
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nake people healthler does It change what it means to be 
human? Dhttps://t.co/Vciuyu6BCx h... 

RT @ForrestTheWoods: 

Launchlng somethlng pretty cool tomorrow. I'm excited. Keep 

Tweepys API class offers a lot of convenience methods which you 
can check out here. A common visualisation is using a network 
graph to show patterns of friends and followers among Twitter sub- 
populations. The Tweepy methods follower_lds (get all users fol- 
lowing) and f riends_ids (get all users being followed) can be used 
to construet such a network: 

my_follower_ids = api.follower_ids() O 

for id in my_foiiowers_ids: 

foiiowers = api.foiiower_ids(id) @ 

# 

O Gets a list of your followers’ ids, e.g. [1191701545, 
1554134420, ...]. 

The first argument to follower_tds can be an id or screen- 
name 

By mapping foUowers of followers etc. you can create a network of 
connections which might just reveal something interesting about 
groups and subgroups clustered about a particular individual or 
subject. There’s a nice example of just such a twitter analysis here. 

One of the coolest features of Tweepy is its StreamLlstener class, 
which makes it easy to collect and process filtered tweets in real- 
time. Live updates of twitter streams have been used by many mem- 
orable visualisations such as tweetping. Lefs set up a little stream to 
record tweets mentioning Python, JavaScript and Dataviz and save it 
to a MongoDB database, using the get_mongo_database method 
from “MongoDB” on page 97: 

# ... 

from tweepy.streaming inport StreamListener 
import json 

# ... 

class MyStreanListener(StreamListener) : 

""" Streans tweets and saves to a MongoDB database """ 
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def _Init_ (self, api, **kw): 

self.apt = api 

super(tweepy.StreamLtstener, self). _init_ () 

self.col = get_nongo_database( 'tweets' , **kw) [ 'tweets' ] O 

def on_data(self , tweet): 

self. coi.insert(json.loads(tweet )) © 

def on_error(self , status): 

return True # keep strean open 


auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 
auth . set_access_token(access_token, access_token_secret) 
api = tweepy.API(auth) 

stream = tweepy.Strean(auth, MyStreamListener(apl)) 

# Start the stream with track-list of keywords 
stream.filter(track=[ 'python' , 'javascript' , 'dataviz']) 

O The extra kw keywords allow us to pass the MongoDB specific 
host, port, username/password arguments to the stream- 
listener. 

© The data is a raw JSON string that needs decoding before insert- 
ing into our tweets collection. 

Now that weVe had a taste of the kind of APIs you might run into in 
your search for interesting data, lefs look at the primary technique 
you’11 use if, as is often the case, no one is providing the data you 
want in a neat, user-friendly form: Scraping data with Python. 

Scraping Data 

Scraping is the chief metaphor used for the practice of getting data 
that wasnt designed to be programmatically consumed off the web. 
It is a pretty good metaphor because scraping is often about getting 
the balance right between removing too much and too little. Creat- 
ing procedures that extract just the right data, as clean as possible, 
from web-pages is a craft skill and often a fairly messy one at that. 
But the pay-off is access to visualizable data that often cannot be got 
in any other way. Approached in the right way scraping can even 
have an intrinsic satisfaction. 
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Why we need to scrape 

In an ideal Virtual world, online data would be organised in a library, 
with everything catalogued using a sophisticated dewey-decimal Sys¬ 
tem for the web-age. Unfortunately for the keen data hunter, the 
web has grown organically, often unconstrained by considerations 
of easy data access for the budding data visualiser. So in reality the 
web resembles a big mound of data, some of it clean and usable (and 
thankfully this percentage is increasing) but much of it poorly 
formed and designed for human consumption. And humans are 
able to parse the kind of messy, poorly-formed data that our rela- 
tively dumb computers have problems with^. 

Scraping is about fashioning selection patterns that grab the data we 
want and leave the rest behind. If we’re lucky the web-pages contain- 
ing the data have helpful pointers, like named tables, speciflc identi- 
ties in preference to generic classes etc.. If we’re unlucky then these 
pointers are missing and we have to resort to using other patterns 
or, in a worst case, ordinal speciflers such as third table in the main 
div. These latter are obviously pretty fragile, broken in this case if 
somebody adds a table above the third. 

Essentially, if you havent been given the data in ‘clean’ form or have 
access to a web-API to deliver the data you need, in JSON, XML or 
some other common format, then you wiU probably find that the 
dataset you need to create your visualisation is encoded in HTML, 
in the form of tables, headers, ordered and unordered lists of con- 
tent and the like. If the data-set is smaU enough, and we’re talking 
very smaU, you could resort to cut and paste but, aside from the 
tedium and inevitable human-error involved, this approach just isnt 
going to scale. But, generally, although the data is secreted within 
blocks of HTML those blocks have some repeated structure and, if 
we’re lucky, CSS labeis. These two facts aUow us to describe, in the 
formal way a computer understands, where that data is in the 
HTML. We can then extract it, manually and on a small scale using 
specialised tools like requests and BeautifulSoup or in bulk, using 
Scrapy (see ???). 


2 Much of modern Machine Learning and Artificial Intelligence research (AI) is dedica- 
ted to creating computer Software that can cope with messy, noisy, fuzzy, informal data 
but, as of this books publication, theres no off-the-shelf solution I know of. 
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In this section we’ll set a little scraping task, aiming to get the some 
Nobel-prize winners data. We’ll use Pythons best-of-breed Beauti- 
fulSoup for this lightweight scraping foray, saving the heavy guns of 
Scrapy for the next chapter. 

BeautifulSoupand Ixml 

Pythons key lightweight scraping tools are BeautifulSoup and Ixml. 
Their primary selection syntax is different but, confusingly, both can 
use each others parsers. The consensus seems to be that Ixml is con- 
siderably faster but BeautifulSoup might be more robust dealing 
with poorly-formed html. Personally, IVe found Ixml to be robust 
enough and its syntax, based on xpaths, more powerful and intu- 
itive. I think for someone coming from web-development, famdiar 
with CSS and Jquery, selection based on CSS is much more natural. 
But, as mentioned, BeautifulSoup allows us access to these selectors 
and has a bigger following, which often pays off in, for example, 
StackOverflow advice. In the foUowing sections TU use Beautiful- 
Soups selectors. In the next chapter we’ll see Ixml xpaths selectors in 
action with Scrapy. 

BeautifulSoup is part of the Anaconda packages (see Chapter 1) and 
easily instaUed with pip: 

$ ptp install beautifulsoup 

A First Scraping Foray 

Armed with requests and BeautifulSoup, lefs set ourselves a little 
task, to get the names, years, categories and nationalities of all the 
Nobel prize-winners. We’ll start at the main Wikipedia Nobel page 
at http://en.wikipedia.org/wiki/List_of_Nobel_laureates. Scrolling 
down shows a table with all the Laureates by year and category, 
which is a good start to our minimal data requirements. 

Some kind of HTML-explorer is pretty much a must for web- 
scraping and the best I know is Chrome web-developer’s elements 
tab (see “The Elements Tab” on page 120). Figure 5-1 shows the key- 
elements involved in quizzing a web-page’s structure. We need to 
know how to select the data of interest, in this case a Wikipedia 
table, while avoiding other elements on the page. Crafting good 
selector patterns is the key to effective scraping and highlighting the 
DOM element using the element inspector gives us both the CSS 
pattern and, with a right-click on the mouse, the xpath. The latter is 
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a particularly powerful syntax for DOM-element selection and the 
basis of our industrial strength scraping solution, Scrapy. 



Figure 5-1. Wikipedias Main Nobel-prize Page: A. and B. show the 
wikitables CSS-selector. Right-clicking the mouse and selecting C. 
('Copy XPath) gives the tables xpath (//*[@id="niw-content- 
text"]/table[l]). D. shows a +<thead> taggenerated byjQuery. 

Selecting the data 

When doing basic scraping on the pages HTML, i.e. not parsing 
JavaScript-generated pages, we are working with the source code. 
Sometimes JavaScript is used to alter the page structure, an example 
being the addition by jQuery of extra tags (see Figure 5-1 D.) to 
make the table sortable. These are only visible on the rendered page 
so cannot be used as selection guides working on the raw source. 
For this reason ifs sensible to have the source HTML to hand and be 
prepared for JS additives. To access the source you can right-click 
and select from the menu or use the CTRL-U short-cut. 

The first thing we need to do is select the data table. The 
get_main_table() method shown in Example 5-3 does just that 
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Example 5-3. Selecting a Wikipedia table with BeautifulSoup 

fron bs4 inport BeautifulSoup 
inport requests 

BASE_URL = 'http://en.wiklpedla.org' 

# wikipedia will reject our request unless we add a 'User-Agent' 

# attribute to our http header. 

HEADERS = {'User-Agent' : 'Mozllla/5.0'} 

def get_maln_table() : 

""" Get the wikitable list of Nobel winners """ 

# Make a request to the Nobel-page, setting valid headers 
response = requests.get( 

BASE_LIRL + '/wlkl/Llst_of_Nobel_laureates ' , 
headers=HEADERS) 

# Parse the response content with BeautifulSoup 
Soup = BeautlfulSoupCresponse.content) 

# Use the parsed tree to find our table 

table = Soup.flnd( 'table' , {' class ': 'wikitable sortable'}) O 
return table 

O The second argument to find takes a dictionary of element 
properties: theres only one table with the classes wikitable and 
sortable. 

First we send the page content to BeautifulSoup, which parses it, 
creating a tree-structure which we apply selectors to. If you look at 
Figure 5-1 you’11 see that our table has the css classes wikitable and 
sortable. Theres only one such table on the page so these two classes 
disambiguate the data table we need. Using the find method on our 
parsed content we pass in a dictionary to filter any child elements by 
class (in this case), id, name etc.. 

Lefs see what we get, using the selections prettify method to return a 
readable string. 

wikitable = get_naln_table() 
prlnt(wlkltable . prettify ()) 

<table class="wlkltable sortable"> 

<tr> 

<th> 

Year 

</th> 

<th wldth="18%"> 

<a href="/wlkl/Llst_of_Nobel_laureates_ln_Physlcs" tltle="Llst of No. 

Physlcs 

</a> 


164 I Chapter 5: Getting Data off the Web with Python 




</th> 

<th wldth="16%"> 

<a href="/wlkl/List_of_Nobel_laureates_in_Chenistry" title="List of 
Chenistry 


<tr> 

<td allgn="center"> 

1901 

</td> 

<td> 

<span class="sortkey"> 

Rontgen, Wllhelm 
</span> 

</table> 

This output shows the parser has successfully constructed an HTML 
tree from our wiki-table. Now let s get some information from it. 

Crafting some selection patterns 

Having successfully selected our data table, we now want to craft 
some selection patterns to scrape the required data. Using the 
HTML-explorer you can see that the individual winners are con- 
tained in <td> cells, with an href <a> link to Wikipedias bio-pages 
(in the case of individuals). Heres a typical target row, with CSS- 
classes we can use as targets to get the data in the <td> cells. 

<tr> 

<td allgn="center"> 

1901 

</td> 

<td> 

<span class="sortkey"> 

Rontgen, Wtlhelm 
</span> 

<span class="vcard"> 

<span class="fn"> 

<a href="/wiki/Wilheln_R%C3%B6ntgen" tltle="Wilheln Rdntgen"> 
Wtlhelm Rontgen 
</a> 

</span> 

</span> 

</td> 

<td> 

</tr> 
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If we loop through these data cells, keeping track of their row (year) 
and column (category) then we should be able to create a list of win- 
ners with ali the data we specified except nationality. 

The following get_column_titles function scrapes our table for the 
Nobel category column headers, ignoring the first Year column. 
Often the header cell to a Wikipedia table contains a web-linked a 
tag and all the Nobel categories fit this model, pointing to their 
respective Wikipedia pages. If the header is not clickable we store its 
text and a null href: 

def get_column_titles(table) : 

""" Cet the Nobel categories fron the table header """ 
cois = [] 

for th in table.find( 'tr '). find_all( 'th' )[1: ]: O 
llnk = th . flnd( 'a' ) 

# Store the category nane and any Nikipedia link it has 
if link: 

cois . append({ 'nane' :link.text,\ 

'href' :llnk.attrs[ 'href' ]}) 

else: 

cois.append({ 'nane' : th . text, 'href' :None}) 
return cois 

O We loop through the table head, ignoring the first Year column 
([!:]). This selects the column tities shown in Figure 5-2. 

Lefs make sure get_colurnn_tltles is giving us what we want: 

get_colunn_titles(wlkltable) 

Out: 

[{'href' : '/wiki/List_of_Nobel_laureates_ln_Physics' , 

'nane' : u'Physics' }, 

{'href' : '/wiki/List_of_Nobel_laureates_in_Chenistry' , 

'nane': u'Chenistry' ... 
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Figure 5-2. Wikipedias table of Nobel-winners 


We use the column names from get_column_titles in the follow- 
ing get_nobel_winners_BS function: 

def get_nobel_wlnners_BS(table) : 
cois = get_column_tltles(table) 
winners = [] 

for row in table.flnd_all(' tr ') [1:-1] : O 

year = lnt(row. flnd( 'td' ). text) # Gets Ist <td> 
for i, td in enunerate(row. flnd_all(' td' ) [1: ]): & 
for wlnner in td.find_all( 'a' ): 
href = wlnner . attrs[ 'href' ] 
if not href . startswith( '#endnote' ): 
winners . append({ 

'year' :year, 

'category' :cols[i] [ 'name' ], 

'nane' : wlnner . text, 

'llnk' :winner . attrs[ 'href' ] 

}) 

return winners 

O Gets ali the Year-rows, starting from the second, corresponding 
to the rows in Figure 5-2. 

© Finds the <td> data-cells shown in Figure 5-2. 

Iterating through the year rows, we take the first Year column and 
then iterate over the remaining columns, using enumerate to keep 
track of our index, which will map to the category column names. 
We know that all the winner names are contained in an <a> tag but 
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that there are occasional extra <a> tags beginning with #endnote, 
which we filter for. Finally we append a year, category, name and 
link dict to our data-array. Note that the winner selector has an attrs 
dict containing, among other things, the <a> tags href 

Lets use our wikitable to confirm that get_nobel_winners_BS deliv- 
ers a list of winner dictionaries with the correct attributes. We’ll use 
Pythons built-in pprint module to pretty-print the results: 

fron pprint inport pprint 

wikitable = get_nain_table() 

winners = get_nobel_winners_BS(wikitable) 

pprint(winners[ : 10] ) 

[{'category': u'Physics', 

'link': '/wiki/Wilheln_R%C3%B6ntgen', 

'name': u'Wllhelm R\xf6ntgen', 

'year': 1901}, 

{'category': u'Chemistry', 

'link': '/wiki/9acobus_Henricus_van_%27t_Hoff', 

'name': u"Jacobus Henrlcus van 't Hoff", 

'year': 1901}, 

{'category': u'Physiology\nor Medicine', 

'link' : '/wiki/Emil_Adolf_von_Behring', 

'name': u'Emil Adolf von Behrlng', 

'year': 1901}, 

Now that we have the fuU list of Nobel prize-winners and links to 
their Wikipedia pages, we will be using these links to scrape data 
from the individuals’ biographies. This will involve making a largish 
number of requests, and ifs not something we really want to do 
more than once. The sensible thing is to cache the data we scrape, 
allowing us to try out various scraping experiments without return- 
ing to Wikipedia. 

Caching the web-pages 

Ifs easy enough to rustle up a quick cacher in Python but as often as 
not easier stili to find a better solution written by someone else and 
kindly donated to the open-source community. Requests has a nice 
plugin called requests-cache which, with a few lines of conflgura- 
tion, will take care of all your basic caching needs. 

First we install the plugin using plp: 

$ plp install --upgrade requests-cache 
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requests -cache uses Monkey-patching to dynamically replace parts 
of the requests API at run-time. This means it can work transpar- 
ently. You just have to install its cache and then use requests as 
usual, with all the caching being taken care of. Heres the simplest 
wayto use requests-cache: 

inport requests 
inport requests_cache 

requests_cache . lnstall_cache( ) 

# use requests as usual... 

The install_cache method has a number of useful options, e.g. 
allowing you to specify the cache backend (sqllte, nenory, nongdb 
or redis) or set an expiry time (expiry_after) in seconds on the 
caching. So the foUowing creates a cache named nobel_pages with an 
sqllte backend and pages that expire in two hours (7200s). 

requests_cache . lnstall_cache( 'nobel_pages' ,\ 

backend^ 'sqllte' , explre_after=7200) 

If you get tired trying to calculate the number of seconds for your 
expiry time, you set the timedelta in install_cache: 

fron datetime inport tinedelta 
explre_after = tinedelta(days=l) 

requests_cache.lnstall_cache(expire_after=expire_after) 

requests-cache will serve most of your caching needs and couldnt 
be much easier to use. For more details see here where you’11 also 
fmd a little example of request-throttling, a useful technique when 
doing bulk scraping. 

Scraping the winners' nationalities 

With caching in place, lefs try getting the winners’ nationalities, 
using the flrst fifty for our experiment. A little get_nationality() 
function will use the winner links we stored earlier to scrape their 
page and then use the infobox shown in Figure 5-3 to get the 
Nationality attribute. 
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Figure 5-3. Scrapinga winners nationality 



When scraping you are looking for reliable pat- 
terns, repeating elements with useful data. As 
we’ll see, the Wikipedia infoboxes for individu- 
als are not such a reliable source but clicking on 
a few random links certainly gives that impres- 
sion. Depending on the size of the dataset, it’s 
good to perform a few experimental sanity- 
checks. You can do this manuaUy but, as men- 
tioned at the start of the chapter, this won’t scale 
or improve your craft skiUs. 


Example 5-4 takes one of the winner dictionaries we scraped earlier 
and returns a name-labelled dictionary with a Nationality key if one 
were found. Let s run it on the first fifty winners and see how often a 
Nationlity attribute is missing: 


Example 5-4. Scraping the winners countryfrom their biography page 

def get_nati.onality(w): 

""" scrape blographlc data fron the winner's wikipedia page """ 
data = get_url('http://en.wikipedia.org/' + w['link']) 

Soup = BeautifulSoup(data) 
person_data = {'nane': w['nane']} 
attr_rows = soup.select('table.infobox tr') O 
for tr in attr_rows: & 

try: 
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attribute = tr.find('th').text 
if attribute == 'Nationality': 

person_data[attribute] = tr.find('td').text 
except AttributeError: 
pass 

return person_data 

O Selects all rows of the infobox table 
© Cycles through the rows looking for a Nationality field. 



Marie SModowsKa Curie. c. 1920 
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7 November 1867 
Watsaw, Kingdom ol F>olana 
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Figure 5-4. Winners without a recorded nationality 


Example 5-5 shows 14 of the 50 first winners failed our attempt to 
scrape their nationality. In the case of the Institut de Droit Interna¬ 
tional national affiliation may well be moot but Theodore Roosevelt 
is about as American as they come. Clicking on a few of the names 
shows the problem (see Figure 5-4). The lack of a standardised biog- 
raphy format means synonyms for Nationality are often employed, 
as in Marie Curies Citizenship, sometimes no reference is made, as 
with Niels Finsen and Randall Cremer has nothing but a photo- 
graph in his info-box. We can discard the infoboxes as a reliable 
source of winners’ nationalities but, as they appeared to be the only 
regular source of potted data, this sends us back to the drawing 
board. In the next chapter we’Il see a successful approach using 
Scrapy and a different start page. 
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Example 5-5. Testingfor scraped nationalities 

wdata = [] 

# test flrst 50 wlnners 
for w in winners[:50]: 

wdata.append(get_nationality(w)) 
nisstng_nattonallty = [] 
for w in wdata: 

# If missing ' Natfonallty' add to list 
tf not w.get('Natfonallty'): 

missfng_natfonallty.append(w) 

# output list 
nfsslng_natlonallty 

[{'nane': u'\xc9lie Duconmun'}, 

{'nane': u'Charles Albert Gobat'}, 

{'nane': u'Marte Curte'}, 

{'nane': u'Nlels Ryberg Flnsen'}, 

{'nane': u'Randal Crener'}, 

{'nane': u'Institut de Drolt International'}, 
{'nane': u'Bertha von 5uttner'}, 

{'nane': u'Theodore Roosevelt'}, 


Although Wikipedia is a relative free-for-aU, production-wise, where 
data is designed for human-consumption you can expect a lack of 
rigour. Many sites have similar gotchas and as the data sets get big- 
ger more tests may be needed to find the flaws in a collection pat- 
tern. 

Although our first scraping exercise was a little artifidal, in order to 
introduce the tools, I hope it captured something of the slightly 
messy spirit of web-scraping. The ultimately abortive pursuit of a 
reliable nationality field for our Nobel data-set could have been fore- 
stalled by a bit of web-browsing and manual HTML-source trawling 
but if the dataset was significantly larger and the failure rate a bit 
smaller then programmatic detection, which gets easier and easier as 
you become acquainted with the scraping modules, really starts to 
deliver. 

This little scraping test was designed to introduce BeautifulSoup and 
shows that collecting the data we set ourselves requires a little more 
thought, often the case with scraping. In the next chapter we’11 wheel 
out the big gun Scrapy and, with what weVe learned in this section, 
harvest the data we need for our Nobel-visualisation. 
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Summary 

In this chapter weVe seen examples of the most common ways in 
which data can be sucked out of the web and into Python contain- 
ers, databases or Pandas DataSets. Pythons requests library is the 
true work-horse of HTTP negotiation and a fundamental tool in our 
dataviz toolchain. For simpler, RESTful APIs, consuming data with 
requests is a few lines of Python away For the more awkward APIs, 
for example those with potentially complicated authorisation, a 
wrapper library like Tweepy (for twitter) can save a lot of hassle. 
Decent wrappers can also keep track of access rates and where nec- 
essary throttle your requests. This is a key consideration, particu- 
larly where there is the possibility of black-listing unfriendly 
consumers. 

We also started our first forays into data-scraping, often a necessary 
fall-back where no API exists and the data is for human consump- 
tion. In the next chapter we’ll aim to get all the NobeTprize data 
needed for the books visualisation, using Pythons Scrapy, an 
industrial-strength scraping library. 
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CHAPTER 6 


Heavyweight Scraping with Scrapy 


Where BeautifulSoup is a very handy little pen-knife for fast and 
dirty scraping, Python has a library, Scrapy, which can do large-scale 
data scrapes with ease. It has ali the things youd expect, like built in 
caching (with expiration times), asynchronous threading via 
Pythons Twisted web-framework, User-Agent randomisation and a 
whole lot more. Although ifs got a steeper learning curve than 
BeautifulSoup, it’s stili Python-succinct and quickly becomes rou- 
tine. For any large data scraping jobs, this is a must-have tool. 

In “Scraping Data” on page 160, we managed to scrape a dataset 
containing aU the Nobel-winners by name, year and category. We 
did a speculative scrape of the winners’ linked biography pages 
which showed that extracting the country of nationality was going 
to be difflcult. In this chapter we’ll set the bar on our Nobel-prize 
data a bit higher and aim to scrape objects of the form shown in 
Example 6-1. 


Example 6-1. Our targeted Nobel JSON object 
{ 

'category': u'Physiotogy or Medicine', 

'date_of_birth' : u'8 October 1927', 

'date_of_death' : u'24 March 2002', 

'gender' : 'mate' , 

'iink' : u' http://en.wikipedia.org/wiki/C%C3%A9sar_Miistein ' , 
'nane': u'C\xe9sar Miistein', 

'country': u'Argentina' , 

'place_of_birth' : u'Bah\xeda Bianca , Argentina', 
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'place_of_death' : u'Cambridge , England', 

'year': 1984} 

In addition to this data, we’11 aim to scrape prize winners’ photos 
(where applicable) and some potted biographical data (see 
Figure 6-1). We’11 be using the photos and body-text to add a little 
character to our Nobel-visualisation. 



Figure 6-1. Scraping targets for the prize-winners pages 


Setting up Scrapy 

Scrapy is one of the Anaconda packages (see Chapter 1) so you 
should already have it to hand. If youre not using Anaconda, a 
quick pip install will do the job^: 

$ ptp Install scrapy 

With Scrapy installed, you should have access to the scrapy com- 
mand. Unlike the vast majority of Python libraries, Scrapy is 
designed to be driven from the command-line, within the context of 
a scraping project, deflned by conflguration files, scraping-spiders, 
pipelines etc. Let’s generate a fresh project for our Nobel-prize 
scraping, using the startproject option. This is going to generate a 


2 See here for any platform-specific details. 
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project folder so make sure you run it from a suitable work direc- 
tory: 

$ scrapy startproject nobel_wlnners 

New Scrapy project 'nobel_wtnners' created in: 

/home/kyran/workspace/.../scrapy/nobel_winners 

You can start your flrst sptder with: 
cd nobel_wtnners 

scrapy gensplder exanple example.com 

As the output of startproject says, you’11 want to switch to the 
nobel_winners directory in order to start driving scrapy. 

Lets take a look at the projects directory tree: 

nobel_wtnners 
I— nobel_wlnners 
i I — _i-nit_.py 

I I— Items.py 

I I— plpellnes.py 

I I— settlngs.py 

I '— sptders 

I '— init .py 

'— scrapy.cfg 

As shown, the project directory has a sub-directory with the same 
name and a config file scrapy.cfg. The nobel_winners sub- 

directory is a Python module (containing an_init_. py file) with 

a few skeleton files and a spiders directory, which will contain your 
scrapers. 

EstablishingtheTargets 

In “Scraping Data” on page 160 we tried to scrape the Nobel win- 
ners’ nationalities from their biography pages but found they were 
missing or inconsistently labelled in many cases (see ???). Rather 
than get the country-data indirectly, a little Wikipedia searching 
shows a way through. There is a page that lists winners by country. 
The winners are presented in titled, ordered-lists (see Figure 6-2), 
not in tabular form, which makes recovering our basic name, cate- 
gory and year data a little harder. Also the data organisation is not 
ideal, e.g. the header tities and lists arent in useful, separate blocks. 
But it stili nets us the all-important country fleld and a few, welT 
structured xpath queries should easily target the data we need. 
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Figure 6-2 shows the starting page for our first spider along with the 
key elements it will be targeting. A list of country name tities (A) are 
followed by an ordered list (B) of their Nobel-prize winning citizens. 


In order to scrape the list-data we need to fire up our Chrome 
browsers development tools (see “The Elements Tab” on page 120) 
and inspect the target elements using the Elements tab and its 
inspector (magnifying glass). Figure 6-3 shows the key HTML tar- 
gets for our first spider: Header tities (h2) containing a country 
name and foUowed by an ordered list (ol) of winners (lis). 


^ 


m 


en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country 


73 Yugoslavia 

74 See also 

75 References 


/ (Argentina } ( 


edit] 


B 


A 


Cesar Miistein, Physiologyor Medicine, 1984 
Adolfo Perez Esquivel, Peace, 1980 
Luis Federico Leloir, Chemistry, 1970 
Bemardo Houssay, Piiysiologyor Medicine, 1947 
Carlos Saavedra Lamas, Peace, 1936 


'(Australia ) 


[edit] 


1. Brian P. Schmidt, bom in the United States, Physics, 2011 

2. Elizabeth H. Blackburn*, Physiology or Medicine, 2009 


Figure 6-2. Scraping Wikipedias Nobel-prizes by nationionality 


178 I Chapter 6: Heavyweight Scraping with Scrapy 



























Argentina [ea»] 

1. Cesar Miistein, Physiology or Medicine. 1984 

2. Adoifo Perez Esquivel, Peace. 1980 

3. Luis Federico Leloir, Chemislry, 1970 

4. Bemardo Houssay, Physiology or Medicine, 1947 

5. Catlos Saavedra Lamas, Peace. 1936 


0 I Elements (NetWork Sources Timeline Profiles Resources Audits Consote 


h2 

ol 


lis 


<span class=''irw-headline" id="Argentlna"fArgentina^/span> 
► <span class=''irw-editsection">^</span> 


</h2> 

r<ol> 

▼ <li> 


country name 


href="/wiki/CSsC3%A9sar Miistein" title="Cesar Milsteln’'>Cesar Milstein</a> 


", Physiology or Medicine, 1984" 
</li> 

► <li>-.</li> 

► <li>„</li> 

► <li>„</li> 

► <li>_.</li> 

</ol> 

► <h2>«</h2> 


Figure 6-3. Finding the HTML-targetsfor the Wiki-list 


Targeting HTML with Xpaths 

Scrapy uses the xpaths to define its HTML targets. Xpath is a syntax 
for describing parts of an X(HT)ML document and while it can get 
rather complicated, the basies are straightforward and will often 
solve the job at hand. 

You can get the xpath of an HTML element by using Chromes Ele¬ 
ments tab to hover over the source and then right-clicking the 
mouse and selecting Copy Xpath. For example, in the case of our 
Nobel Wiki-lists country names (h2 in Figure 6-3) , selecting the 
xpath of ‘Argentina’ (the first country) gives the following: 

//*[@ld="nw-content-text"]/h2[l] 

We can use the following xpath rules to decode it: 


//E 

//E[@id="foo"] 

//*[@id="foo"] 

//E/F[l] 

//E/*[l] 


Element <E> by relative referente (in this case reiative to the root Document) 

seiect Eiement <E> with id foo 

seiect any element with id foo 

first child element <F> of element <E> 

first child of element <E> 
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Following these rules shows our Argentinian title //*[@ld="mw- 
content-text"]/h2[l] is the first header (h2) child of a DOM ele- 
ment with id rnw-content-text. This is equivalent to the following 
HTML: 

<div ld="nw-content-text"> 

<h2> 

</h2> 

</div> 

Note that unlike Python, the xpaths don’t use a zero-based index but 
make the first member ‘1’. 

Testing xpaths with the Scrapy shell 

Getting your xpath targeting right is crucial to good scraping and 
can involve a degree of iteration. Scrapy makes this process much 
easier by providing a command-line shell, which takes a URL and 
creates a response context in which you can try out your xpaths, like 
so: 


$ scrapy shell https://en.wlklp...Llst_of_Nobel_laureates_by_country 

2015-12-15 17:42:12+0000 [scrapy] INFO: Scrapy 0.24.4 started 
(bot: nobel_wlnners) 

2015-12-15 17:42:12+0000 [default] INFO: Splder opened 
2015-12-15 17:42:13+0000 [default] DEBUG: Crawled (200) 

<GET https://en.wlklp...Llst_of_Nobel_laureates_by_country> 

(referer: None) 

[s] Avallable Scrapy objects: 

[s] crawler <scrapy.crawler.Crawler object at 0x3a8f510> 

[s] item {} 

[s] request <GET https://en.wlk. ..Nobel_laureates_by_country> 

[s] response <200 https://en.wlk. ..Nobel_laureates_by_country> 

[s] settlngs <scrapy.settlngs.Settlngs object at 0x34a98d0> 

[s] splder <Splder 'default' at 0x3f59190> 

[s] Useful shortcuts: 

[s] shelpO Shell help (prlnt this help) 

[s] fetch(req_or_url) Fetch request (or URL) and update local 
objects 

[s] vlew(response) Vtew response In a browser 
In [1]: 
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Now we have an IPython-based shell with code-complete and syntax 
highlighting, in which to try out our xpath targeting. Let’s grab all 
tbe <h2> beaders on tbe Wiki-page: 

In [1]: h2s = response.xpath('//h2') 

Tbe resulting h2s is a SelectorList, a specialised Pytbon llst object. 

Lets see bow many beaders we bave: 

len(h2s) 

Out: 

76 

We can grab tbe first Selector object and query its metbods using tab 
auto-complete: 

h2 = h2s[0] 
h2. 

h2.css h2.napiespaces h2. renove_nanespaces 

h2.text h2.extract h2.re 

h2.response h2.type h2.reglster_namespace h2.select 

You’11 often be using tbe extract metbod to get tbe raw resuit of tbe 
xpatb selector: 

h2.extracto 
Out: 

u'<h2>Contents</h2>' 

Tbis sbows our first <h2> beader is tbat of tbe table of contents for 
our list of winners by nationality. Lefs look at tbe second beader: 

h2s[l] .extracto 
u'<h2> 

<span class="mw-headllne" ld="Argenttna">Argenttna</span> 

<span class="mw-editsection"><span class="nw-editsectton-bracket"> 

</h2>' 

Tbis sbows tbat our country beaders start on tbe second <h2> and 
contain a span witb class mw-headline. We can use tbe presence of 
tbe rnw-headllne class as a filter for our country beaders and tbe 
contents as our country label. Lefs try out an xpatb, using tbe selec¬ 
tor’s text metbod to extract tbe text from tbe mw-headline span. 

Note tbat we use tbe xpath metbod of tbe <h2> selector, wbicb 
makes tbe xpatb query relative to tbat element. 

h2_arg = h2s[l] 

country = h2_arg.xpath(' span[@class="nw-headline"]/textO ') .extractO 
country 
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Out: 

[u'Argentlna'] 

The extract method returns a list of possible matches, in our case 
the single Argentina string. By iterating through the h2s list, we can 
now get our country names. 

Assuming we have a countrys <h2> header, we now need to get the 
<ol> ordered list of Nobel winners following it (Figure 6-2 B.). 
Handily the xpath following - slbling selector can do just that. Lets 
grab the first ordered list after the Argentina header: 

ol_arg = h2_arg.xpath('following-slbllng::ol[l]') 

ol_arg 

Out: 

[<Selector xpath='following-slbling::ol[l]' data=u'<ol>\n<li> 

<a href="/wlki/C%C3%A9sar_Mllst'>] 

Looking at the truncated data for ol_arg shows we have selected an 
ordered-list. Note that even though theres only one Selector, xpath 
stili returns a SelectorList. For convenience you’11 generally just 
select the first member directly: 

ol_arg = h2_arg.xpath('followlng-slbllng::ol[l]')[0] 

Now that weVe got the ordered list, lets get a list of its member <li> 
elements: 

lls_arg = ol_arg.xpath('ll') 
len(lls_arg) 

Out: 5 

Lefs examine one of those list elements using extract. As a first 
test, we’re looking to scrape the name of the winner and capture the 
list element s text. 

ll = lls_arg[0] # select the first list element 

ll.extracto 

Out: 

u'<lt><a href="/wlki/C%C3?oA9sar_Mllsteln" 

tltle="C\xe9sar Mllsteln">C\xe9sar Mtlsteln</a>, 
Physiology or Medicine, 1984</li>' 

Extracting the list element shows a Standard pattern: A hyperlinked 
name to the winners Wikipedia page foUowed by a comma- 
separated winning category and year. A robust way to get the win- 
ning name is just to select the text of the list elements first <a> tag: 
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name = ll.xpath('a//text()')[0].extract() 
name 

Out: u'C\xe9sar Mllsteln' 

Its often useful to get ali the text in, for example, a list element, 
stripping the various HTML <a>, <span> etc. tags. descendent-or- 
self gives us a handy way of doing this, producing a list of the 
descendents’ text: 

list_text = li.xpath('descendant-or-self::text()').extracto 
ltst_text 

Out: [u'C\xe9sar Mllsteln', u', Phystology or Medicine, 1984'] 
We can get the full text by joining the list elements together: 

' '.join(ltst_text) 

Out: u'C\xe9sar Mllsteln , Physlology or Medicine, 1984' 

Note that the first item of list_text is the winners name, giving us 
another way to access it if, for example, it was missing a hyperlink. 

Now that weVe established the xpaths to our scraping targets (the 
name and link-text of the Nobel winners) lefs incorporate them into 
our first Scrapy spider. 

Getting the right Xpath expression for your tar- 
get element(s) can be a little tricky and those dif- 
ficult edge cases can demand a complex nest of 
clauses. The use of a well-written cheat-sheet 
can be a great help here and thankfully there are 
many good xpath ones. A very nice selection can 
be found here with this color-coded one being 
particularly useful. 

A First Scrapy Spider 

Armed with a little Xpath knowledge, lefs produce our first scraper, 
aiming to get the country and link-text for the winners (Figure 6-2 
A. and B.). 

Scrapy calls its scrapers spiders, each of which is a Python module 
placed in the spiders directory of your project. We’ll call our first 
scraper nwinner_list_spider.py: 

I — nobel_wlnners 
I I — _inlt_.py 

I I — Items.py 
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I I— plpelines.py 

I I— settings.py 

I '— splders 

I |— _i-nlt_.py 

I '— nwlnners_list_spider.py < — 

'— scrapy.cfg 

Spiders are sub-classed scrapy. Spider classes and any placed in the 
splders directory will be automatically detected by Scrapy and 
made accessible by name to the scrapy command. 

The basic Scrapy spider shown in Example 6-2 follows a pattern 
you’11 be using with most of your spiders. First you subclass a Scrapy 
item to create fields for your scraped data (section A 
inExample 6-2). You then create a named spider by subclassing 
scrapy.Spider (section B in Example 6-2). The spiders name will 
be used when calling scrapy from the command line. Each spider 
has a parse method which deals with the HTTP requests to a list of 
start URLs contained in a start_url class attribute. In our case the 
start URL is the Wikipedia page for Nobel laureates by country. 


Example 6-2. Afirst Scrapy spider 

# nwinners_list_spider.py 

inport scrapy 
inport re 

# A. Define the data to be scraped 
class NWinnerIten(scrapy. Iten) : 

country = scrapy.Field() 
nane = scrapy.Fleld() 
link_text = scrapy.Fteld() 

# B Create a named spider 

class NWinnerSpider(scrapy .Spider) : 

""" Scrapes the country and link-text of the Nobel-winners. """ 

nane = 'nwlnners_list' 
allowed_donalns = ['en.wlkipedia.org'] 
start_urls = [ 

"http://en.wlklpedla.org ... of_Nobel_laureates_by_country" 

] 

# C A parse method to deal with the HTTP response 
def parse(self, response): 

h2s = response.xpath( '//h2' ) O 

Itens = [] 
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for h2 in h2s: 

country = h2.xpath(' span[@class="niw-headline"]/text()' )\ o 

.extracto 

if country: 

winners = h2.xpath( 'followtng-sibling::ol[l]' ) €> 
for w In winners.xpath( 'li '): 

text = w.xpath( 'descendant-or-self: :textO ' )\ 
.extracto 

Items.append(NWlnnerItem( 

country=country[0] , name=text[0] , 
link_text = ' '.joln(text) 

)) 

return itens 

O Gets ali the <h2> headers on the page, most of which will be our 
target country tities. 

& Where possible, gets the text of the <h2> elements child <span> 
with class mw-headline. 

e> Gets the list of country winners. 

The par se method in Example 6-2 receives the response from an 
HTTP requests to the Wikipedia Nobel page and is required to 
return a list containing Scrapy items. These items are then converted 
to a list of JSON objects which can be saved to an output file. 

Lefs run our flrst spider to make sure we’re correctly parsing and 
scraping our Nobel data. First navigate to the nobel_wlnners root 
directory (containing the scrapy.cfg file) of the scraping project. 
Lefs see what scraping spiders are available: 

$ scrapy list 
nwlnners_llst 

As expected, we have one nwinners_list spider sitting in the spt 
ders directory. To start it scraping we use the crawl command and 
direct the output to a nwinners. json file. By default we will get a lot 
of Python logging Information accompanying the crawl: 

$ scrapy crawl nwlnners_llst -o nobel_wlnners.json 

2015- ... [scrapy] INFO: Scrapy 0.24.4 started (bot: nobel_wlnners) 

2015- ... [scrapy] INFO: Optlonal features available: ssl, httpll 

2015- ... [nwlnners_llst] INFO: Closlng spider (flnlshed) 

2015- ... [nwlnners_llst] INFO: Dunplng Scrapy stats: 

{'downloader/request_bytes': 551, 
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'downloader/request_count': 1 , 

'downloader/request_method_count/CET': 2, 

'downloader/response_bytes': 45469, 

' i.ten_scraped_count': 1075, O 
2015- ... [nwLnners_list] INFO: Spider closed (finlshed) 

O We scraped 1075 Nobel winners from the page. 

The output of the scrapy crawl shows 1075 items successfully scra¬ 
ped. Let’s look at our JSON output file to make sure things have 
gone according to plan: 

$ head nobel_winners.json 
[{"country": "Argenttna", 

"link_text": "C\u00e9sar Mtlstein , Physiology or Medicine, 1984" 
"name": "C\u00e9sar Mtlstein"}, 

{"country": "Argenttna", 

"link_text": "Adolfo P\u00e9rez Esquivel , Peace, 1980", 

"name": "Adolfo P\u00e9rez Esquivel"}, 


As you can see, we have an array of JSON objects with the four key 
fields present and correct. 

Now we have a spider that successfuUy scrapes the list-data for all 
the Nobel winners on the page, lefs start refining it to grab all the 
data we are targeting for our Nobel visualisation (see Example 6-1 
and Figure 6-1). 

First, lefs add all the data we plan to scrape as fields to our 
scrapy. Item: 


class NWinnerIten(scrapy . Item) : 
name = scrapy . FieldO 
link = scrapy . FieldO 
year = scrapy . Field( ) 
category = scrapy.Field() 
nationality = scrapy . FieldO 
gender = scrapy.Field() 
born_in = scrapy. FieldO 
date_of_birth = scrapy . Field() 
date_of_death = scrapy. FieldO 
place_of_birth = scrapy . Field( ) 
place_of_death = scrapy. FieldO 
text = scrapy . FieldO 
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Its also sensible to simplify the code a bit and use a dedicated func- 
tion, process_wlnner_li to process the winners’ link-text. We’11 
pass a link selector and country name to it and return a dictionary 
containing the scraped data: 

def parse(self, response): 

h2s = response.xpath( '//h2' ) 

Items = [] 
for h2 in h2s: 

country = h2.xpath( 'span[@class="nw-headline"]/text()' ) .extract( ) 
if country: 

winners = h2.xpath( 'fotlowing-sibling::ol[l]' ) 
for w in winners.xpath( 'li ') : 

wdata = process_wlnner_ll(w, country[0]) 

The process_wlnner_li method is shown in Example 6-3 

Example 6-3. Processing a winners list item 

# ... 

inport re 

BASE_URL = 'http://en.wiklpedla.org' 

# 

def process_wlnner_ll(w, country=None) : 

Process a winner's <li> tag, adding country of birth or 
nationality, as applicable. 

wdata = {} 

wdata[ 'llnk' ] = BASE_URL + w.xpath( 'a/@href' ) .extract() [0] O 

text = ' ' . joln(w.xpath( 'descendant-or-self::text()' ) 

.extractO) 

# get coma-delineated nane and strip trailing uhite-space 
wdata[ 'nane' ] = text.spllt(' ,' )[0] .strlp() 

year = re.flndall( '\d{4}' , text) © 
if year: 

wdata[ 'year' ] = lnt(year[0]) 

else: 

wdata[ 'year' ] = 0 
print('Oops, no year In ', text) 

category = re.flndall( 

'PhyslcsIChenlstry|Physlology or Medicine|Literature|Peace|Econonics' , 
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text) © 

if category: 

wdata[ 'category' ] = category[0] 

else: 

wdata[ 'category' ] = '' 
print('Oops, no category in text) 

if country : 

if text.find( '*' ) != -1: O 
wdata[ 'nationality' ] = '' 
wdata[ 'born_in' ] = country 

else: 

wdata[ 'nationality' ] = country 
wdata[ 'born_in' ] = '' 

# store a cop\/ of the link's text-string for any nanual corrections 
wdata[ 'text' ] = text 
return wdata 

O To grab the href attribute from the list-items <a> tag (<li><a 
href=/w!fc!...>[winner name]</a>...) , we use the xpath 
attribute referent 

© Here we use re, Pythons built-in regex library, to find the four- 
digit year strings in the list-items text. 

© Another use of the regex library to find the Nobel prize category 
in the text. 

O An asterisk following the winners name is used to indicate that 
the country is the winners by birth not nationality at the time of 
the prize. e.g. “WiUiam Lawrence Bragg’*^, Physics, 1915” in the 
list for Austrialia. 


Embracing regex 

Some people, when confronted with a problem, think “I know, 
rU use regular expressions.” Now they have two problems. 

—Jamie Zawinskie 

The above quote is a hoary old classic but does sum up what many 
people feel about regular expressions (regex). There is something of 
the Alien Hieroglyphics about them and tales abound of recursive 
patterns gone horribly wrong. But the fact is that web-scraping is 
often about pattern matching messy and under-specified data and 
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regex is pretty much tailor-made for many of the jobs that crop up. 
You can probably hack your way around them but embracing them 
a little will make your life that much easier and the good news is 
that a little goes a long way. See Example 6-3 for some examples. 

_ 

Example 6-3 returns all the winners’ data available on the main 
Wikipedia Nobels by Nationality page, i.e. the name, year, category, 
country (of birth or when awarded the prize) and a link to the indi- 
vidual winners’ pages. We’ll need to use this last information to get 
those biographical pages and use them to scrape our remaining tar- 
get data (see Example 6-1 and Figure 6-1). 

Scraping the Individual Biography Pages 

The main Wikipedia Nobels by Nationality page gave us a lot of our 
target data, but winner’s date of birth, date of death (where applica- 
ble) and gender are stili to be scraped. This information is hopefully 
available, either implicitly or explidtly, on their biography pages (for 
non-organization winners). Now’s a good time to fire up Chrome’s 
Elements tab and take a look at those pages, to work out how we’re 
going to extract the desired data. 

We saw in the last chapter (???) that the visible information boxes on 
individuafs pages are not a reliable source of information and are 
often missing entirely. Until recently^, a hidden persondata table 
(see Figure 6-4) gave fairly reliable access to such information as 
place-of-birth, date-of-death etc. Unfortunately this handy resource 
has been deprecatedl The good news is that this is part of an 
attempt to improve the categorisation of biographical information 
by giving it a dedicated space in Wikidata, Wikipedia’s Central stor- 
age for its structured data. 


2 The author got stung by this removal... 

3 See here for an insight into Wikipedia dispute management. 
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</tabLe> 

<table id=*'|ersondat|'' cla5s = "^r5ondi^| noprint" sty1.e=*'border: Ipx solid #a 

<t r> _ 

<tb colspan="2''><a href=' Vwiki/WiklDed iarPersQndatd " title="Wikipedia:Person 
</tr> 
r> 

<td clas5=‘'Persondatj-label‘' style="color:#aaa; ''>Name</td> 

<td>Rbntgen, Wilhelfn</td> 

</tr> 

<t r> 

<td clas5=*'|^rsonda^-label" style="color:#aaa; ">Alternative names</td> 
<td>Conrad</td> 

</tr> 


Figure 6-4. A Nobel winners hidden persondata table 


Examining Wikipedias biography pages with Chromes Elements tab 
shows a link to the relevant Wikidata item (see Figure 6-5), which 
takes you to the biographical data held at wlkidata.org. By follow- 
ing this link we can scrape whatever we find there, hopefully the 
bulk of our target data significant dates and places (see 
Example 6-1). 


Page information 
Wikidata item 

fll#t-wikibase 119.265p' • 

l. w - 

Print/export 
Create a book 


Biography [edit] 


Miistein was born in Bahia Blanca, 
Argentina, to a Jewish family. His 
parents were Maxima (Vapniarsky) i 


Gt □ 


Elements Console Sources NetWork Timeline Profiles 
‘Hi I0="p-ID- laoei • rooLS* /n.55^ 

▼ • div c\ass= body '> 

▼ <ul> 

► <li id= t-whatlinkshere --.</11^ 

►' li id= t- recentchangeslinked' >-,</li> 

► <li id- t-upload">-.</li> 

► <li id=' t-specialpages">«.</li> 

► • li id=' t-permalink">-.</li> 

► ' li id= t-info '>.-</li' 


Resources Audits 


a href-' //WWW. wikidata.org/wiki/Q155525 title=' Link to connected d 
shift-g]' accesskey= g Wikidata item- /a - 
«7li • 

► <li id= t-cite '>.-</li> 

</ul> 


Figure 6-5. Hyperlink to the winners wikidata 


Following the link to Wikidata shows a page containing flelds for 
the data we are looking for, for example the date-of-birth of our 
prize winner. As Figure 6-6 shows, the properties are embedded in a 
nest of computer generated HTML, with related codes, which we 
can use as a scraping identifier. e.g. data-of-birth has the code P569. 


190 I Chapter 6: Heavyweight Scraping with Scrapy 



















^ 8 October 1927 Gresorian 


► 1 reference 


data of da ath ■ 24 March 2002 _ 

K □ Elements Console Sources NetWork Timeline Profiles Resources Audits 

class= wikiDase-statementgroupview Listview-item id=‘ P4b3 '>Z</cliv> 

► • div class= wikibase-statementgroupview listview-item id= P373 •>-</div> 
►•div class= wikibase-statementgroupview listview-item id= P244 '>»</div> 

► • div class= wikibase-statementgroupview listview-item id= P269 '>-</div> 
▼•div class= wikibase-statementgroupview listview-item id= P569'> 

▼ -idiv class= wikibase-statementgroupview-property"> 

▼ - div class='wikibase-stateinentgroupview-property-label dir= auto' style 
relative: top: -0.468012px: left: 0.408524px; » 

a title Property:P569 href /wiki/Property:P569 date of birth /a 
•/div> 

</div • 

► <div class= wikibase-statementlistview wikibase-toolbar-item '>.-</div:- 
</div> 

► <div class= wikibase-statementgroupview listview-item id= P570 '>_</div> 



Figure 6-6. Biographical properties at Wikidata 


As Figure 6-7 shows, the actual data we want, in this case a date- 
string, is contained in a further nested branch of HTML, within its 
respective property tag. By selecting the div and right-clicking we 
can store the elemenfs xpath and use that to teli Scrapy how to get 
the data it contains. 


Elements Console Sources NetWork Timeline Profiles Resources Audits 

► '^div class= wikibase-statementview-rankselector >_.■ /div> 
▼-div class= wikibase-statementview-mainsnak-container > 

▼•div class= wikibase-statementview-mainsnak dir='auto' - 
▼-div class=' wikibase-snakview ■ 

► <div class= wikibase-snakview-property-Container ">_</div 
▼ div class= wikibase-snakview-value-container' dir='auto 
'div class= wikibase-snakview-typeselector -/div 
- PffOTCTg - 


"8 October 1927" 

«sup class='wb-calendar-name 
</div:- 
</div' 

</div^ I 

</div> 1 

<div class=' wikibase-statementview-qua 
</div^ 


Add Attribute 
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Editas HTML 




Hide element 
Delete element 


mulation Rendering 
:top frame> 


▼ B Preserve log 
g entitytermsforlanguagelistview DOM does not match con 


:active 

rhover 

focus 

:vislted 


CopyouterHTML 
Copy seleaor 

Cut element 
Copy element 
Paste element 


Figure 6-7. Getting the xpath for a Wikidata property 


Now we have the xpaths necessary to find our scraping targets, lets 
put it aU together and see how Scrapy chains requests, allowing for 
complex, multi-page scraping operations. 
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Chaining Requests and Yielding Data 

In this section we’ll see how to chain Scrapy requests, allowing us to 
follow hyper-links, scraping data as we go. First let’s enable Scrapys 
page-caching. While experimenting with xpath targets etc. we want 
to limit the number of calls to Wikipedia and it’s good manners to 
be storing our fetched pages. Unlike some data-sets out there, our 
Nobel-prizewinners’ changes but once a year^. 

Caching our pages 

As you might expect, Scrapy has a sophisticated caching system that 
gives you fine-grained control over your page-caching, e.g. allowing 
you to choose between database or filesystem storage backends, how 
long before your pages are expired etc.. It is implemented as middle- 
ware enabled in our projects settings.py module. There are vari- 
ous options available but for the purposes of our Nobel scraping, 
simply setting HTTPCACHE_ENABLED to True will suffice: 

# coding: utf-8 

# Scrapy settings for nobel_winners projeci 

# 

# For sinplicity, this file contains only the nost important settings by 

# default. Ali the other settings are documented here: 

# 

# http://doc. scrapy.org/en/latest/topics/ settings.html 

# 

B0T_NAME = 'nobel_wlnners' 

SPIDER_MODULES = [ 'nobel_wlnners.spiders' ] 

NEWSPIDER_MODLILE = ' nobel_wtnners. spiders ' 

# Cravil responsibly by identifying yourself (and your website) on the user-agent 
#USER_ACENT = 'nobel_winners (+http://www.yourdomain.com)' 

HTTPCACHE_ENABLED = True 

Check out the full range of Scrapy middleware here. 

Having ticked the caching box, lefs see how to chain Scray requests. 


2 Strictly speaking there are edits being made continually by the Wikipedia community 
but the fundamental details should be stable until the next set of prizes. 
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Yielding requests 

Our existing spiders parse method cycles through the Nobel win- 
ners, using the process_wlnner_li method to scrape the country, 
name, year and category and biography-hyperlink fields. We now 
want to use the biography hyperlinks to generate a Scrapy request 
which will fetch the bio-pages and send them to a custom-method 
for scraping. 

Scrapy implements a Pythonic pattern for chaining requests, using 
Pythons yield statement to create a generator^, allowing Scrapy to 
easily consume any extra page-requests we make. Example 6-4 
shows the pattern in action. 

Example 6-4. Yielding a request with Scrapy 

class NWinnerSpider(scrapy . Splder) : 
nane = 'nwtnners_full' 
allowed_donalns = ['en.wtkipedia.org'] 
start_uris = [ 

"http://en.wikipedia.org/wiki..._Nobei_iaureates_by_country" 

] 

def parse(seif, response): 

fiienane = response.uri.spiit( '/')[ -1] 
h2s = response.xpath( '//h2' ) 
itens = [] 

for h2 in iist(h2s) [: 2] : 

country = h2.xpath( 'span[@ciass="nw-headiine"]/text()' ) 
.extracto 
if country: 

winners = h2.xpath( 'foilowing-sibling::ol[l]' ) 
for w in winners.xpath( 'ii '): 

wdata = process_winner_ii(w, country[0]) 
request = scrapy.Request( O 
wdata[ 'iink' ], 
caiiback=seif . parse_bio, & 
dont_fiiter=T rue) 

request.neta[ 'iten' ] = NWinnerIten(**wdata) © 
yield request O 

def parse_bio(self , response): 

iten = response.neta[ 'iten' ] © 


2 See here for a nice run-down of Python generators and the use of yield. 
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O Makes a request to the winners biography page, using the link 
(wdata[ Zink]) scrapedfrom process_wlnner_li. 

O Sets the callback function to handle the response. 

€) Creates a Scrapy Item to hold our Nobel-data and initialises it 
with the data just scraped from process_winner_ll. This Item- 
data is attached to the meta-data of the request to allow any 
response access to it. 

O By yielding the request, we make the parse method a generator 
of consumable requests. 

0 This method handles the callback from our bio-link request. In 
order to add scraped data to our Scrapy Item, we first retrieve it 
from the response meta-data. 

Our investigation of the Wikipedia pages in “Scraping the Individual 
Biography Pages” on page 189 showed that we need to locate a win¬ 
ners Wikidata link from their biography-page and use it to generate 
a request. We will then scrape the date, place, and gender data from 
the response. 

Example 6-5 shows parse_bio and parse_wikidata, the two meth- 
ods used to scrape our winners’ biographical data. parse_bio uses 
the scraped Wikidata link to request the Wikidata page, yielding the 
request as it in turn was yielded in the parse method. At the end of 
the request chain, parse_wiktdata retrieves the item and fills in any 
of the flelds available from Wikidata, eventually yielding the item to 
Scrapy. 

Example 6-5. Parsing the winners’ biography data 
# ... 


def parse_bio(self , response): 
iten = response.neta [ 'item' ] 

href = response. xpath("//ti[@i.d=' t-wikibase' ]/a/@href" ) O 
.extracto 
if href: 

request = scrapy. Request('https: ' + href[0],\ 

callback=self.parse_wtkidata,\ @ 
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dont_filter=True) 
request.neta[ 'iten' ] = Iten 
yield request 


def parse_wtkidata(self , response): 
iten = response.neta [ 'iten' ] 
property_codes = [ © 

{ 'nane' : 'date_of_birth' , 'code' : 'P569' }, 

{ 'nane' : 'date_of_death' , 'code' : 'P570' }, 

{ 'nane' : 'place_of_birth' , 'code' : 'P19' , 'llnk' :True}, 

{ 'nane' : 'place_of_death' , 'code' : 'P20' , 'link' :True}, 

{ 'nane' : 'gender' , 'code' : 'P21' , 'iink':True} 

] 

p_tenplate = '//*[@ld="%(code)s"]/div[2]/div/div/div[2] 
/dlv[l]/div/div[2]/div[2] ' O 

for prop in property_codes: 

extra_htni = '' 

if prop.get('iink ' ): # property string in <a> tag 
extra_htni = '/a' 

sei = response.xpath(p_tenpiate%prop + extra_htni + '/text()') 
if sel: 

iten[prop[ 'nane' ]] = sel[0] .extract() 
yield iten © 

O Extracts the link to Wikidata identified in Figure 6-5. 

© Uses the Wikidata link to generate a request with our spider s 
parse_wikidata as a callback to deal with the response. 

© These are the property codes we found earlier (see Figure 6-6), 
with names corresponding to fields in our Scrapy Item, NWinner 
Iten. Those with a True link attribute are contained in <a> tags. 

O The nasty, nested xpath for the Wikidata properties used to cre¬ 
ate this template comes straight from the Chromes Elements tab 
(see Figure 6-7). 

© Finally we yield the item which at this point should have all the 
target data available from Wikipedia. 

With our request chain in place, lefs check that the spider is scrap- 
ing our required data: 
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$ scrapy crawl nwlnners_full 

2015-... [scrapy] ... started (bot: nobel_wlnners) 

2015-... [nwlnners_full] DEBUC: Scraped from 

<200 https://WWW.wLkidata.org/wlki/Q155525> 

{ 'born_in' : ' ' , 

'category': u'Physiology or Medicine', 

'date_of_birth' : u'8 October 1927', 

'date_of_death' : u'24 March 2002', 

'gender' : u 'male' , 

'Itnk' : u'http://en.wlkipedia.org/wlkl/C%C3%A9sar_Mllsteln' , 
'nane': u'C\xe9sar Mllsteln', 

'natlonallty' : u 'Argentlna' , 

'place_of_blrth' : u'Bah\xeda Blanca', 

'place_of_death' : u 'Cambrldge' , 

'text': u'C\xe9sar Mllsteln , Physlology or Medicine, 1984', 
'year': 1984} 

2015-... [nwlnners_full] DEBUC: Scraped from 

<200 https://www.wlkldata.org/wlkl/Q193672> 

{ 'born_ln' : '' , 

'category': u'Peace', 

'date_of_blrth' : u'l November 1878', 

'date_of_death' : u'5 May 1959', 

'gender' : u'male' , 

'llnk' : u ' http: //en.wlklpedla.org/wlkl/Carlos_Saavedra_Lamas' , 


Things are looking good. With the exception of the born_in field, 
which is dependent on a name in the main Wikipedia Nobel win- 
ners list having an asterisk, were getting all the data we were target- 
ing. This data-set is now ready to be cleaned by Pandas in the 
Corning chapter. 

Now that weVe scraped our basic biographical data for the Nobel- 
winners, lets go scrape our remaining targets, some biographical 
body-text, and a picture of the great man or woman, where avail- 
able. 

Scrapy Pipelines 

In order to add a little personality to our Nobel visualisation it 
would be good to have a little biographical text and an image of the 
winner. Wikipedias biographical pages generally provide these 
things so lefs go about scraping them. 

Up to now our scraped data has been text strings. In order to scrape 
images, in their various formats, we need to use a Scrapy pipeline. 
Pipelines provide a way of post-processing the items we have scra- 
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ped and you can define any number of them. You can write your 
own or take advantage of those already provided by Scrapy, such as 
the ImagesPlpellne wellbe using. 

In its simplest form, a plpellne need only define a process_item 
method. This receives the scraped items and the spider object. Lefs 
write a little pipeline to reject genderless Nobel winners (so we can 
omit prizes given to organizations rather than individuals) using our 
existing nwmners_full spider to deliver the items. First we add a 
DropNonPersons pipeline to the pipelines.py module of our 
project: 

# nobel_winners/nobel_winners/settings.py 

# Define your item pipelines here 

# 

# Don't forget to add your pipeline to the ITEM_PIPELINES setting 

# See: http://doc.scrapy.org/en/latest/topics/iten-pipeline.html 

fron scrapy.exceptlons inport Dropiten 


class DropNonPersons(object) : 

""" Remove non-person winners """ 

def process_ltem(self , tten, spider): 

if not ltem[ ' gender' ]: O 

raise DropItem("No gender for %s"%itepi[ ' nane' ]) 
return Iten © 

O If our scraped item failed to find a gender property at Wikidata 
it is probably an organization such as the Red Cross.. Our visu- 
alisation is focused on invdividual winners so here we use DropI 
tem to remove the item from our output stream. 

© We need to return the item to further pipelines or for saving by 
Scrapy. 

As mentioned in the pipelines.py header, in order to add this 
pipeline to the spiders of our project we need to register it in the 
settings. py module by adding it to a dict of pipelines and setting 
it to active (1): 

# nobel_winners/nobel_winners/settings.py 

B0T_NAME = 'nobel_wlnners' 

SPIDER_MODULES = [' nobel_wlnners.spiders' ] 
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NEWSPIDER_MODLILE = ' nobel_wlnners. splders ' 

HTTPCACHE_ENABLED = True 

ITEM_PIPELINES = { 'nobel_winners.plpellnes.DropNonPersons' : 1} 

Now weVe got the basic workflow for our pipelines, lefs add a useful 
one to our project. 

Scraping Text and Images with a Pipeline 

We now want to scrape the winners’ biography and photos (see 
Figure 6-1), where available. The biographical text can be scraped 
using the same method as our last spider but the photos are best 
dealt with by an image pipeline. 

We could easily write our own pipeline to take a scraped image 
URL, request it from Wikipedia, and save to disk but to do it prop- 
erly would requires a bit of care. For example, we would like to 
avoid reloading an image that was recently downloaded or hasnt 
changed in the meantime. Some flexibility in specifying where to 
store the images is a useful feature. It would also be good to have the 
option of converting the images into a common format (e.g. JPG or 
PNG) or of generating thumbnails. Luckily, Scrapy provides an 
IniagesPipellne object with ali this functionality and more. This is 
one ofits media pipelines, which includes a FllesPipeline for deal- 
ing with general files. 

We could add the image and biography-text scraping to our existing 
nwlnners_full spider but thafs starting to get a little large and seg- 
regating this character data from the more formal categories makes 
sense. So we’ll create a new spider, called nwinners_minibio which 
will reuse parts of the previous spiders parse method in order to 
loop through the Nobel winners. 

As usual when creating a Scrapy spider, our first job is to get the 
xpaths for our scraping targets — in this case, where available thafs 
the first part of the winners’ biographical text and a photograph of 
them. To do this we fire up Ghrome Elements and explore the 
HTML source of the biography pages looking for the targets shown 
in Figure 6-8. 
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Figure 6-8. The target elementsfor our biography scraping: Thefirst 
part ofthe biography (A), marked by a stop-point (B) and the winners 
photograph © 


Investigating with Chrome Elements shows the biographical text 
(Figure 6-8 A.) is contained in the flrst paragraphs of the <dlv> with 
id mw-content-text, captured by the xpath //*[(aid="rnw-content- 
text" ]/p. There is an empty paragraph which signals the stop-point 
(Figure 6-8 B.) of the first section of the biography: 

<div ld="piw-content-text"> 

<p>. . .</p> 

<p>. . .</p> 

<p><p> < - stop-point 

</div> 

The exploration shows that the photos (Figure 6-8 C.) are contained 
in a table of class inf obox, being the only image in that table: 

<table class="tnfobox vcard"> 

<ing alt="Francis Crick crop.jpg" src="//upload... /> 

</table> 

The xpath //table[contains(@class,"infobox")]//img/@src will 
get the source address of the image. 
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As with our first spider, we first need to declare a Scrapy Item to 
hold our scraped data. We’ll scrape the bio-link and name of the 
winner, which we can use as identifiers for the image and text. We 
also need somewhere to store our image-urls (although we will only 
scrape one bio-image, TU cover the multiple-image use-case), the 
resultant images references (a file-path), and a blo_image field to 
store the particular image we’re interested in: 

inport scrapy 
inport re 

BASE_LIRL = ' http;//en .wikipedia .org' 


class NHinnerIten(scrapy . Iten) : 
link = scrapy . Fleld( ) 
nane = scrapy . Fleld() 
nlnt_bto = scrapy.FleldO 
inage_urls = scrapy.Fietd() 
blo_inage = scrapy . Fteld() 
tnages = scrapy.Fleld() 


Now we reuse the scraping loop over our Nobel winners (see 
Example 6-4 for details), this time yielding a request to our new 
get_mini_bio method, which will scrape the image-urls and bio- 
text: 


class NWinnerSpider(scrapy. Spider) : 
nane = 'nwlnners_ninlblo' 
allowed_donalns = ['en.wlklpedla.org'] 
start_urls = [ 

"http;//en. wlklpedla.org/. ..of_Nobel_laureates_by_country" 

1 

def parse(self, response): 

fllenane = response.uri.spllt( '/')[-1] 
h2s = response.xpathC //h2' ) 

Itens = [] 

for h2 In h2s: 

country = h2.xpath( 'span[@class="nw-headllne"]/text()' )\ 
.extracto 
If country: 

winners = h2.xpath( 'followlng-slbllng::ol[l]' ) 
for w In winners.xpath( '11 '): 
wdata = {} 

wdata[ 'llnk' ] = BASF_URL + w.xpath( 'a/@href' ) 
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.extracto [0] 

# Process the winner's bio-page with get_nini_bio 
request = scrapy . Request(wdata[ 'llnk' ]> 
callback=self.get_mlnt_bto) 
request.meta[ 'tten' ] = NWinnerItem(**wdata) 
yleld request 

Our get_mini_bio method will add any available photo URLs to the 
lniage_urls list and add all paragraphs of the biography up to the 
<p></p> stop-point to the items nint_blo field: 

def get_mint_bto(self , response): 

Cet the winner's bio-text and photo """ 

BASE_LIRL_ESCAPED = ' http:\/\/en. wtklpedla .org ' 
item = response.neta[ 'Iten' ] 
ltem[ 'lmage_urls' ] = [] 
lng_src = response.xpath(\ 

'//table[contalns(@class,"tnfobox")]//ing/@src' ) O 
if img_src: 

item[ 'lnage_urls' ] = ['http:' + tmg_src[0].extract()] 
mlnl_blo = ' ' 
paras = response.xpath( 

'//*[@td="mw-content-text"]/p[text() or 
normallze-space( ) .extractO & 

for p in paras: 

if p == '<p></p>': # the bio-intros stop-point © 

break 

mint_bio += p 
# correct for wiki-links 

mlni_bio = mtni_bio.replace(' href="/wikl' , 'href="' 

+ BASE_URL + '/wiki') O 

mini_bio = mtnl_bio.replace( 'href="#' , lten['link'] + '#') 
item[ 'nini_bio' ] = ninl_blo 
yield iten 

O Targets the first (and only) image in the table of class Infobox 
and gets its source (src) attribute, e.g. <inig src=//upload. wiki 
media. org/.../Max_Perutz. jpg... 

@ This xpath gets all the paragraphs in the <dtv> with id mw- 
content-text. If the paragraph are empty (text() == False) 
then the normaltze-space(.) command is used to force the 
contents of the paragraph (. represents the p-node in question) 
to an empty string. This is to make sure any empty paragraph 
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matches the stop-point marking the end of the intro-section of 
the biography. 

e> Iterates through the available paragraphs, breaking on the 
empty paragraph stop-point. 

O Replaces wikipedias internal hrefs (e.g. /wiki/...) with the full 
addresses our visualisation will need. 

With our bio-scraping spider defined, we need to create its comple- 
mentary pipeline, which will take the image-URLs scraped and con- 
vert them into saved images. We’U use Scrapys images-pipeline for 
this job. 

The ImagesPlpeline shown in Example 6-6 has two main methods, 
get_media_requests, which generates the requests for the image- 
URLs, and itern_cornpleted, caUed after the requests have been con- 
sumed. 

Example 6-6. Scraping images with the image-pipeline 

inport scrapy 

fron scrapy.contrib.pipeline.inages inport InagesPipeline 
fron scrapy.exceptions inport Droplten 

class NobelInagesPipeline( InagesPipeline) : 

def get_nedia_requests(self , item, info): O 

for inage_url in iten[ 'inage_urls' ]: 
yield scrapy.Request(inage_url) 

def iten_conpleted(self , results, iten, info): & 

ifnage_paths = [x['path'] for ok, x in results if ok] © 
if inage_paths: 

iten[ ' bio_if^age' ] = ifnage_paths[0] 
return iten 

O This takes any image-URLs scraped by our nwinners_minibio 
spider and generates an HTTP request for their content. 

@ After the image-URL requests have been made, the results are 
delivered to the iten_cornpleted method. 
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€) This Python list-comprehension filters the list of resuit tuples 
(of form [(True, Image), (False, Image) ...]) for those that were 
successful and Stores their file-paths relative to directory speci- 
fied by the IMAGES_STORE variable in settings. py. 

Now that we have spider and pipeline defined, we just need to add 
the pipeline to our settings.py module and set the IMAGES_STORE 
variable to the directory we want to save the images in: 

# nobel_winners/nobel_winners/settings.py 


ITEM_PIPELINES = { ' nobel_wlnners.ptpeltnes.NobelIrnagesPtpetine' : 1} 
IMAGES_STORE = 'inages' 

Lefs run our new spider, from the nobel_winners root directory of 
our project and check its output: 

$ scrapy crawl nwtnners_nlntblo -o nintblos.json 

2015-... DEBUG: Scraped fron <200 http:.../Albert_Ctaude> 

{ 'lnage_urts' : [], 

'link' : u 'http://en.wlktpedta.org/wlki/Atbert_Ctaude' , 

'ntnl_bio': u '<p><b>Albert Claude</b> (24 August 1899 \u2013... 

<a href="http://en.wtkipedia.org/wtkl/Betgiun"...>Belgtan</a> 
<a href="http://en.wtkipedla.org/wtki/Medicat_doctor"...> 
2015-... DEBUG: Scraped fron <200 http:.../Brian_P._Schnldt> 

{ 'bio_tnage' : ' futt/a5f763b828006e704cb291411b8b643bfb91886c.jpg ', 
' tnage_urls ': [u' http://upload.wtkl.. .220px-Brtan_Schnidt.jpg '], 

' tlnk' : u' http://en.wtkipedia.org/wlkl/Brtan_P._Schnldt ', 
'ntnt_bto': u'<p><b>Brtan Paul Schntdt</b>... 

We can see that scraping Albert Claudes biography page failed to 
turn up an image (a quick trip to Wikipedia confirms ifs missing) 
but Brian Schmidfs page came up trumps. The image was stored in 
lniage_urls and successfully processed, giving the JPG file stored in 
the images directory we specified with IMAGE_STORE with a relative 
path (fUll/a5f763b828006e704cb291411b8b643bfb91886c.jpg). 
The flle-name is, conveniently enough, a SHAl hash of the images 
URL, which aUows the image-pipeline to check for existing images, 
allowing it to prevent redundant requests etc. 

A quick listing of our images directory shows a nice array of Wiki¬ 
pedia NobeTwinner images, ready to be used in our web- 
visualisation: 
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$ (nobel_wlnners) tree Inages 
inages 
I— full 

I— 0512aelll41584dal262661992alb05dfb20dd52.jpg 
I — 092a92689118cl6bl5bl613751af422439df2850.jpg 
I — 0b6a8ca56e6ffll5b7d30087df9c21da09684dbl.jpg 
I — 1197aa95299alfec983b3dbdeaeb97alf7e545c9.jpg 
I — If6fb8e9e2241733da47328291b25bdla78fa588.jpg 
I — 272cflb089c7a28ea0109ad8655bc3eflc03fb52.jpg 
I — 28dcc7978d9d5710f0c29d6dfcf09caa7el3ald0.jpg 

As we’11 see in ??? we will be placing these in the static folder of 
our web-app, ready to be accessed using the winners blo_iniage 
field. 

With our images and biography text to hand weVe successfully scra- 
ped all the targets we set ourselves at the beginning of the chapter 
(see Example 6-1 and Figure 6-1). Time for a quick summary before 
moving on to clean this inevitably dirty data with help from Pandas. 

Summary 

In this chapter we produced two Scrapy spiders which managed to 
grab the simple statistical data-set of our Nobel winners plus some 
biographical text and, where available, a photograph, to add some 
color to the stats. Scrapy is a powerful library which takes care of 
everything you could need in a full-fledged scraper. Although the 
work-flow requires a little more effort to implement than some 
hacking with BeautifulSoup, Scrapy has far more power available 
and for any ambitious scraping reaUy is a no-brainer. All Scrapy spi¬ 
ders follow a Standard recipe, which you should now know, and 
after a while rolling one out for Standard scraping tasks is a breeze. 

I hope this chapter has conveyed the rather hacky, iterative nature of 
scraping and some of the quiet satisfaction that can be had from 
producing relatively clean data from the unpromising mound of 
stuff so often found on the web. The fact is that now and for the 
foreseeable future the large majority of interesting data, the fiiel to 
the art-tscience of data-visualisation, is trapped in a form unusable 
for the web-based visualisations that are the focus of this book. 
Scraping is, in this sense, an emancipating endeavour. 

The data we scraped, much of it human-edited, wih certainly have 
some errors, from badly formatted dates to categorical anomalies to 
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missing fields etc.. Making that data presentable is the focus of the 
next, Pandas-based, chapters. But first we need a little introduction 
to Pandas and its building block Numpy. 
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